Multiprocessing starmap to a particular function in Pyhon code - python

How to add starmap functionality to a function in python code?
I was trying to use the startmap for parallel processing of one function in python code. I have added this function
def process_files_with_multiprocessing(inputs, env):
with mp.Pool(processes=int(env["CPU_COUNT"])) as pool:
results = pool.starmap(process_files, inputs)
return results
Where in the actual function for parallel processing is.. (the entire function is not shown)
def process_files():
"""
The process_files function separates filenames according to the date formats in their names
and stores them in respective lists. Filenames with date ranges in their names are expanded
for initial date and final date of receiving.

Related

Global variables not recognized in lambda functions in Pyspark

I am working in Pyspark with a lambda function like the following:
udf_func = UserDefinedFunction(lambda value: method1(value, dict_global), IntegerType())
result_col = udf_func(df[atr1])
The implementation of the method1 is the next one:
def method1(value, dict_global):
result = len(dict_global)
if (value in dict_global):
result = dict_global[value]
return result
'dict_global' is a global dictionary that contains some values.
The problem is that when I execute the lambda function the result is always None. For any reason the 'method1' function doesn't interpret the variable 'dict_global' as an external variable. Why? What could I do?
Finally I found a solution. I write it below:
Lambda functions (as well as map and reduce functions) executed in SPARK schedule the executions among the different executors, and it works in different execution threads. So the problem in my code could be global variables sometimes are not caught by the functions executed in parallel in different threads, so I looked for a solution to try solve it.
Fortunately, in SPARK there is an element called "Broadcast" which allows to pass variables to the execution of a function organized among the executors to work with them without problems. There are 2 type of sharable variables: Broadcast (inmutable variables, only for read) and accumulators (mutable variables, but numeric values only accepted).
I rewrite my code to show you how did I fix the problem:
broadcastVar = sc.broadcast(dict_global)
udf_func = UserDefinedFunction(lambda value: method1(value, boradcastVar), IntegerType())
result_col = udf_func(df[atr1])
Hope it helps!

Access Shared DataFrame in Multiprocessing Map

I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.
Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.
You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)

Call a user defined function given its name as a string input to another function Python 2.7

I'm new to Python and using Anaconda (editor: Spyder) to write some simple functions. I've created a collection of 20 functions and saved them in separate .py files (file names are the same as function names).
For example
def func1(X)
Y=...
return Y
I have another function that takes as input a function name as string (one of those 20 functions), calls it, does some calculations and return the output.
def Main(String,X)
Z=...
W=String(Z)
V=...
return V
How can I choose the function based on string input?
More details:
The Main function calculates the Sobol Indices of a given function. I write the Main function. My colleagues write their own functions (each might be more than 500 lines of codes) and just want to use Main to get the Sobol indices. I will give Main to other people so I do NOT know what Main will get as a function in the future. I also do not want the user of Main to go through the trouble of making a dictionary.
Functions are objects in Python. This means you can store them in dictionaries. One approach is to dispatch the function calls by storing the names you wish to call as keys and the functions as values.
So for example:
import func1, func2
operation_dispatcher = {
"func1": getattr(func1, "func1"),
"func2": getattr(func2, "func2"),
}
def something_calling_funcs(func_name, param):
"""Calls func_name with param"""
func_to_call = operation_dispatcher.get(func_name, None)
if func_to_call:
func_to_call(param)
Now it might be possible to generate the dispatch table more automatically with something like __import__ but there might be a better design in this case (perhaps consider reorganizing your imports).
EDIT took me a minute to fully test this because I had to set up a few files, you can potentially do something like this if you have a lot of names to import and don't want to have to specify each one manually in the dictionary:
import importlib
func_names = ["func1", "func2"]
operation_dispatch = {
name : getattr(importlib.import_module(name), name)
for name in func_names}
#usage
result = operation_dispatch[function_name](param)
Note that this assumes that the function names and module names are the same. This uses importlib to import the module names from the strings provided in func_names here.
You'll just want to have a dictionary of functions, like this:
call = {
"func1": func1,
"functionOne": func1,
"func2": func2,
}
Note that you can have multiple keys for the same function if necessary and the name doesn't need to match the function exactly, as long as the user enters the right key.
Then you can call this function like this:
def Main(String,X)
Z=...
W=call[String](Z)
V=...
return V
Though I recommend catching an error when the user fails to enter a valid key.
def Main(String,X)
Z=...
try:
W=call[String](Z)
except KeyError:
raise(NameError, String + " is not a valid function key")
V=...
return V

Lists/Dictionaries from function not returned

I'm trying to write a script in Python using BioPython that reads a FASTA file and generates a list of the raw DNA sequences as entries.
As this code will be used by many other scripts I will be writing, I want the function for this purpose to be in a separate Python file, which I can import at the start of every other script I write. The script containing the function I am currently calling is as so:
from Bio import SeqIO
def read_fasta(dna):
genome = []
for seq_record in SeqIO.parse(dna, "fasta"):
genome.append(str(seq_record.seq))
return genome
When I call this function in Python from cmd, the function works and reads the files generating the list as I wish. However, if I try to access the list genome again, I get an Traceback | NameError: name 'genome' not defined error.
Can somebody explain why this is happening, even thought I have put the return genome statement? And what I can do to fix this problem?
genome is in the local scope of the function, so it is not visible from the "outside". You should assign result of read_fasta function to some variable in order to access the returned result of the function. For example:
new_variable = read_fasta("pcr_template.fasta")
And it is read - let the new_variable be assigned to the result of the function read_fasta with "pcr_template.fasta" as argument.
Now the genome (or anything that your function has returned) is accessed simply by accessing new_variable.

Perform a for-loop in parallel in Python 3.2 [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
how do I parallelize a simple python loop?
I'm quite new to Python (using Python 3.2) and I have a question concerning parallelisation. I have a for-loop that I wish to execute in parallel using "multiprocessing" in Python 3.2:
def computation:
global output
for x in range(i,j):
localResult = ... #perform some computation as a function of i and j
output.append(localResult)
In total, I want to perform this computation for a range of i=0 to j=100. Thus I want to create a number of processes that each call the function "computation" with a subdomain of the total range. Any ideas of how do to this? Is there a better way than using multiprocessing?
More specific, I want to perform a domain decomposition and I have the following code:
from multiprocessing import Pool
class testModule:
def __init__(self):
self
def computation(self, args):
start, end = args
print('start: ', start, ' end: ', end)
testMod = testModule()
length = 100
np=4
p = Pool(processes=np)
p.map(yes tMod.computation, [(length, startPosition, length//np) for startPosition in range(0, length, length//np)])
I get an error message mentioning PicklingError. Any ideas what could be the problem here?
Joblib is designed specifically to wrap around multiprocessing for the purposes of simple parallel looping. I suggest using that instead of grappling with multiprocessing directly.
The simple case looks something like this:
from joblib import Parallel, delayed
Parallel(n_jobs=2)(delayed(foo)(i**2) for i in range(10)) # n_jobs = number of processes
The syntax is simple once you understand it. We are using generator syntax in which delayed is used to call function foo with its arguments contained in the parentheses that follow.
In your case, you should either rewrite your for loop with generator syntax, or define another function (i.e. 'worker' function) to perform the operations of a single loop iteration and place that into the generator syntax of a call to Parallel.
In the later case, you would do something like:
Parallel(n_jobs=2)(delayed(foo)(parameters) for x in range(i,j))
where foo is a function you define to handle the body of your for loop. Note that you do not want to append to a list, since Parallel is returning a list anyway.
In this case, you probably want to define a simple function to perform the calculation and get localResult.
def getLocalResult(args):
""" Do whatever you want in this func.
The point is that it takes x,i,j and
returns localResult
"""
x,i,j = args #unpack args
return doSomething(x,i,j)
Now in your computation function, you just create a pool of workers and map the local results:
import multiprocessing
def computation(np=4):
""" np is number of processes to fork """
p = multiprocessing.Pool(np)
output = p.map(getLocalResults, [(x,i,j) for x in range(i,j)] )
return output
I've removed the global here because it's unnecessary (globals are usually unnecessary). In your calling routine you should just do output.extend(computation(np=4)) or something similar.
EDIT
Here's a "working" example of your code:
from multiprocessing import Pool
def computation(args):
length, startPosition, npoints = args
print(args)
length = 100
np=4
p = Pool(processes=np)
p.map(computation, [(startPosition,startPosition+length//np, length//np) for startPosition in range(0, length, length//np)])
Note that what you had didn't work because you were using an instance method as your function. multiprocessing starts new processes and sends the information between processes via pickle, therefore, only objects which can be pickled can be used. Note that it really doesn't make sense to use an instance method anyway. Each process is a copy of the parent, so any changes to state which happen in the processes do not propagate back to the parent anyway.

Categories

Resources