Python Pandas Multiprocessing Apply - python

I am wondering if there is a way to do a pandas dataframe apply function in parallel. I have looked around and haven't found anything. At least in theory I think it should be fairly simple to implement but haven't seen anything. This is practically the textbook definition of parallel after all.. Has anyone else tried this or know of a way? If no one has any ideas I think I might just try writing it myself.
The code I am working with is below. Sorry for the lack of import statements. They are mixed in with a lot of other things.
def apply_extract_entities(row):
names=[]
counter=0
print row
for sent in nltk.sent_tokenize(open(row['file_name'], "r+b").read()):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'node'):
names+= [chunk.node, ' '.join(c[0] for c in chunk.leaves())]
counter+=1
print counter
return names
data9_2['proper_nouns']=data9_2.apply(apply_extract_entities, axis=1)
EDIT:
So here is what I tried. I tried running it with just the first five element of my iterable and it is taking longer than it would if I ran it serially so I assume it is not working.
os.chdir(str(home))
data9_2=pd.read_csv('edgarsdc3.csv')
os.chdir(str(home)+str('//defmtest'))
#import stuff
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
#define apply function and apply it
os.chdir(str(home)+str('//defmtest'))
####
#this is our apply function
def apply_extract_entities(row):
names=[]
counter=0
print row
for sent in nltk.sent_tokenize(open(row['file_name'], "r+b").read()):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'node'):
names+= [chunk.node, ' '.join(c[0] for c in chunk.leaves())]
counter+=1
print counter
return names
#need something that populates a list of sections of a dataframe
def dataframe_splitter(df):
df_list=range(len(df))
for i in xrange(len(df)):
sliced=df.ix[i]
df_list[i]=sliced
return df_list
df_list=dataframe_splitter(data9_2)
#df_list=range(len(data9_2))
print df_list
#the multiprocessing section
import multiprocessing
def worker(arg):
print arg
(arg)['proper_nouns']=arg.apply(apply_extract_entities, axis=1)
return arg
pool = multiprocessing.Pool(processes=10)
# get list of pieces
res = pool.imap_unordered(worker, df_list[:5])
res2= list(itertools.chain(*res))
pool.close()
pool.join()
# re-assemble pieces into the final output
output = data9_2.head(1).concatenate(res)
print output.head()

With multiprocessing, it's best to generate several large blocks of data, then re-assemble them to produce the final output.
source
import multiprocessing
def worker(arg):
return arg*2
pool = multiprocessing.Pool()
# get list of pieces
res = pool.map(worker, [1,2,3])
pool.close()
pool.join()
# re-assemble pieces into the final output
output = sum(res)
print 'got:',output
output
got: 12

Related

Python concurrent.futures global variables

I have a multiprocessing code, and each process have to analyse same data differently.
The input data is always the same, it is not changeable.
Input data is a data frame 20 columns and 60k rows.
How to efficiently 'put' this data to each process?
On single process application I have used global variable, but in multiprocessing it's not working.
When I try to transfer this as a function argument, I have only the first element of the table
Welcome to Stack Overflow. You need to take the time and give reproducible minimal working examples to get specific answers and help the society in general.
Anyway, you shouldn't use global variables if you need to change them with each iteration/process/etc.
Multiprocessing works like that in rough easily-digestible terms:
import concurrent.futures
import glob
def manipulate_data_function(data):
result = torture_data(data)
return result
# ProcessPoolExecutor for CPU bound stuff
with concurrent.futures.ThreadPoolExecutor(max_workers = None) as executor:
futures = []
for file in glob.glob('*txt'):
futures.append(executor.submit(manipulate_data_function, data))
thank you for the answer, I don't change this date each iteration. I use the same data to each process, data how to change the data is given throw function argument
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p)
for f in concurrent.futures.as_completed(res):
fp = res
and next
def goal_fcn(x):
return heavy_calculation(x, global_DataFrame, global_String)
EDIT:
it work with:
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p, [global_DataFrame], [global_String])
for f in concurrent.futures.as_completed(res):
fp = res
def goal_fcn(x, DataFrame, String):
return heavy_calculation(x, DataFrame, String)

Running different function parallel in python3.7 using multiprocessing

I want to run two different function in parallel in python, I have used the below code :
def remove_special_char(data):
data['Description'] = data['Description'].apply(lambda val: re.sub(r'^=', "'=", str(val))) # Change cell values which start with '=' sign leading to Excel formula issues
return(data)
file_path1 = '.\file1.xlsx'
file_path2 = '.\file2.xlsx'
def method1(file_path1):
data = pd.read_excel(file_path1)
data= remove_special_char(data)
return data
def method2(file_path2):
data = pd.read_excel(file_path2)
data= remove_special_char(data)
return data
I am using the below Pool process , but its not working.
from multiprocessing import Pool
p = Pool(3)
result1 = p.map(method1(file_path1), args=file_path1)
result2 = p.map(method2(file_path1), args=file_path2)
I want to run both these methods in parallel to save execution time and at the same time get the return value as well.
I don't know why you are defining the same method twice with different parameter names, but anyway the map method of Pools is taking as its first argument a function, and the second argument is an iterable. What map does is call the function on each item of the iterable, and return a list with all the results. So what you want to do is more something like:
from multiprocessing import Pool
file_paths = ('.\file1.xlsx', '.\file2.xlsx')
def method(file_path):
data = pd.read_excel(file_path)
data= remove_special_char(data)
return data
with Pool(3) as p:
result = p.map(method, file_paths)

Sharing large objects in multiprocessing pools

I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

Handle multiple results in Python multiprocessing

I'm writing a Python piece of code to parse a lot of ascii file using multiprocessing functionality.
For each file I've to perform the operations of this function
def parse_file(file_name):
record = False
path_include = []
buffer_include = []
include_file_filters = {}
include_keylines = {}
grids_lines = []
mat_name_lines = []
pids_name_lines = []
pids_shell_lines= []
pids_weld_lines = []
shells_lines = []
welds_lines = []
with open(file_name, 'rb') as in_file:
for lineID, line in enumerate(in_file):
if record:
path_include += line
if record and re.search(r'[\'|\"]$', line.strip()):
buffer_include.append(re_path_include.search(
path_include).group(1).replace('\n', ''))
record = False
if 'INCLUDE' in line and '$' not in line:
if re_path_include.search(line):
buffer_include.append(
re_path_include.search(line).group(1))
else:
path_include = line
record = True
if line.startswith('GRID'):
grids_lines += [lineID]
if line.startswith('$HMNAME MAT'):
mat_name_lines += [lineID]
if line.startswith('$HMNAME PROP'):
pids_name_lines += [lineID]
if line.startswith('PSHELL'):
pids_shell_lines += [lineID]
if line.startswith('PWELD'):
pids_weld_lines += [lineID]
if line.startswith(('CTRIA3', 'CQUAD4')):
shells_lines += [lineID]
if line.startswith('CWELD'):
welds_lines += [lineID]
include_keylines = {'grid': grids_lines, 'mat_name': mat_name_lines, 'pid_name': pids_name_lines, \
'pid_shell': pids_shell_lines, 'pid_weld': pids_weld_lines, 'shell': shells_lines, 'weld': welds_lines}
include_file_filters = {file_name: include_keylines}
return buffer_include, include_file_filters
This function is used in a loop through list of files, in this way (each process on CPU parse one entire file)
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
for include in grouper([list_of_file_path]):
current = mp.current_process()
print 'Running: ', current.name, current._identity
results = p.map(parse_file, include)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
The grouper function used above is defined as
def grouper(iterable, padvalue=None):
return itertools.izip_longest(*[iter(iterable)]*mp.cpu_count(), fillvalue=padvalue)
I'm using Python 2.7.15 in cpu with 4 cores (Intel Core i3-6006U).
When I run my code, I see all the CPUs engaged on 100%, the output in Python console as Running: MainProcess () but nothing appened otherwise. It seems that my code is blocked at instruction results = p.map(parse_file, include) and can't go ahead (the code works well when i parse the files one at a time without parallelization).
What is wrong?
How can I deal with the results given by parse_file function
during parallel execution?My approach is correct or not?
Thanks in advance for your support
EDIT
Thanks darc for your reply. I've tried your suggestion but the issue is the same. The problem, seems to be overcome if I put the code under if statement like so
if __name__ == '__main__':
Maybe this is due to the manner in which Python IDLE handle the process. I'm using the IDLE environ for development and debugging reasons.
according to python docs:
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
since it is blocking your process wait until parse file is done.
since map already chnucks the iterable you can try to send all of the includes together as one large iterable.
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
results = p.map(parse_file, list_of_file_path, 1)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
if you want to keep the original loop use apply_async, or if you are using python3 you can use ProcessPoolExecutor submit() function and read the results.

Parellel function call in python

I am quite new to python.I have been thinking of making the below code to parellel calls where a list of doj values are formatted with help of lambda,
m_df[['doj']] = m_df[['doj']].apply(lambda x: formatdoj(*x), axis=1)
def formatdoj(doj):
doj = str(doj).split(" ")[0]
doj = datetime.strptime(doj, '%Y' + "-" + '%m' + "-" + "%d")
return doj
Since the list has million records, the time it takes to format all takes a lot of time.
How to make parellel function call in python similar to Parellel.Foreach in c#?
I think that in your case using parallel computation is a bit of an overkill. The slowness comes from the code, not from using a single processor. I'll show you in some steps how to make it faster, guessing a bit that you're working with a Pandas dataframe and what your dataframe contains (please stick to SO guidelines and include a complete working example!!)
For my test, I've used the following random dataframe with 100k rows (scale times up to get to your case):
N=int(1e5)
m_df = pd.DataFrame([['{}-{}-{}'.format(y,m,d)]
for y,m,d in zip(np.random.randint(2007,2019,N),
np.random.randint(1,13,N),
np.random.randint(1,28,N))],
columns=['doj'])
Now this is your code:
tstart = time()
m_df[['doj']] = m_df[['doj']].apply(lambda x: formatdoj(*x), axis=1)
print("Done in {:.3f}s".format(time()-tstart))
On my machine it runs in around 5.1s. It has several problems. The first one is you're using dataframes instead of series, although you work only on one column, and creating a useless lambda function. Simply doing:
m_df['doj'].apply(formatdoj)
Cuts down the time to 1.6s. Also joining strings with '+' is slow in python, you can change your formatdoj to:
def faster_formatdoj(doj):
return datetime.strptime(doj.split()[0], '%Y-%m-%d')
m_df['doj'] = m_df['doj'].apply(faster_formatdoj)
This is not a great improvement but does cut down a bit to 1.5s. If you need to join the strings for real (because e.g. they are not fixed), rather use '-'.join('%Y','%m','%d'), that's faster.
But the true bottleneck comes from using datetime.strptime a lot of times. It is intrinsically a slow command - dates are a bulky thing. On the other hand, if you have millions of dates, and assuming they're not uniformly spread since the beginning of humankind, chances are they are massively duplicated. So the following is how you should truly do it:
tstart = time()
# Create a new column with only the first word
m_df['doj_split'] = m_df['doj'].apply(lambda x: x.split()[0])
converter = {
x: faster_formatdoj(x) for x in m_df['doj_split'].unique()
}
m_df['doj'] = m_df['doj_split'].apply(lambda x: converter[x])
# Drop the column we added
m_df.drop(['doj_split'], axis=1, inplace=True)
print("Done in {:.3f}s".format(time()-tstart))
This works in around 0.2/0.3s, more than 10 times faster than your original implementation.
After all this, if you still are running to slow, you can consider working in parallel (rather parallelizing separately the first "split" instruction and, maybe, the apply-lambda part, otherwise you'd be creating many different "converter" dictionaries nullifying the gain). But I'd take that as a last step rather than the first solution...
[EDIT]: Originally in the first step of the last code box I used m_df['doj_split'] = m_df['doj'].str.split().apply(lambda x: x[0]) which is functionally equivalent but a bit slower than m_df['doj_split'] = m_df['doj'].apply(lambda x: x.split()[0]). I'm not entirely sure why, probably because it's essentially applying two functions instead of one.
Your best bet is to use dask. Dask has a data_frame type which you can use to create this a similar dataframe, but, while executing compute function, you can specify number of cores with num_worker argument. this will parallelize the task
Since I'm not sure about your example, I will give you another one using the multiprocessing library:
# -*- coding: utf-8 -*-
import multiprocessing as mp
input_list = ["str1", "str2", "str3", "str4"]
def format_str(str_input):
str_output = str_input + "_test"
return str_output
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
result = p.map(format_str, input_list)
print (result)
Now, let's say you want to map a function with several arguments, you should then use starmap():
# -*- coding: utf-8 -*-
import multiprocessing as mp
input_list = ["str1", "str2", "str3", "str4"]
def format_str(str_input, i):
str_output = str_input + "_test" + str(i)
return str_output
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
result = p.starmap(format_str, [(input_list, i) for i in range(len(input_list))])
print (result)
Do not forget to place the Pool inside the if __name__ == '__main__': and that multiprocessing will not work inside an IDE such as spyder (or others), thus you'll need to run the script in the cmd.
To keep the results, you can either save them to a file, or keep the cmd open at the end with os.system("pause") (Windows) or an input() on Linux.
It's a fairly simple way to use multiprocessing with python.

Categories

Resources