I have a for loop, which uses some binary conditions and finally writes a file accordingly. The problem I have is, the conditions are true for many files (sometimes around 1000 files need to be written). So writing them takes a long time (around 10 mins). I know I can somehow use Python's multiprocessing and utilise some of the cores.
This is the code that works, but only uses one core.
for i,n in enumerate(halo_param.strip()):
mask = var1['halo_id'] == n
newtbdata = tbdata1[mask]
hdu = pyfits.BinTableHDU(newtbdata)
hdu.writeto(('/home/Documments/file_{0}.fits').format(i))
I came across that it can be done using Pool from multiprocessing.
if __name__ == '__main__':
pool = Pool(processes=4)
I would like to know how to do it and utilise atleast 4 of my cores.
Restructure the for loop body as a function, and use Pool.map with the function.
def work(arg):
i, n = arg
mask = var1['halo_id'] == n
newtbdata = tbdata1[mask]
hdu = pyfits.BinTableHDU(newtbdata)
hdu.writeto(('/home/Documments/file_{0}.fits').format(i))
if __name__ == '__main__':
pool = Pool(processes=4)
pool.map(work, enumerate(halo_param.strip()))
pool.close()
pool.join()
Related
Hello I've been working on a huge csv file which needs similarity tests done. There is 1.16million rows and to test similarity between each rows it takes approximately 7 hours. I want to use multiple threads to reduce the time it takes to do so. My function which does the similarity test is:
def similarity():
for i in range(0, 1000):
for j in range(i+1, 1000):
longestSentence = 0
commonWords = 0
row1 = dff['Product'].iloc[i]
row2 = dff['Product'].iloc[j]
wordsRow1 = row1.split()
wordsRow2 = row2.split()
# iki tumcedede esit olan sozcukler
common = list(set(wordsRow1).intersection(wordsRow2))
if len(wordsRow1) > len(wordsRow2):
longestSentence = len(wordsRow1)
commonWords = calculate(common, wordsRow1)
else:
longestSentence = len(wordsRow2)
commonWords = calculate(common, wordsRow2)
print(i, j, (commonWords / longestSentence) * 100)
def calculate(common, longestRow):#esit sozcuklerin bulunmasi
sum = 0
for word in common:
sum += longestRow.count(word)
return sum
I am using ThreadPoolExecutor to do multithreading and the code to do so is:
with ThreadPoolExecutor(max_workers=500) as executor:
for result in executor.map(similarity()):
print(result)
But even if I set max_workers to incredible amounts the code runs the same. How can I make it so the code runs faster? Is there any other way?
I tried to do it with threading library but it doesn't work because it just starts the threads to do the same job over and over again. So if I do 10 threads it just starts the function 10 times to do the same thing. Thanks in advance for any help.
ThreadPoolExecutor will not actually help a lot because ThreadPool is more for IO tasks. Let's say you would do 500 API calls this would work but since you are doing heavy CPU tasks it does not work. You should use ProcessPoolExecutor but also point attention that making max_workers numbers greater than the number of your cores will not do anything as well.
Also, your syntax is incorrect because you are running the same function inside your pool.
But I think you need to change your algorithm to make this work properly. There is definitely something wrong with your time compexity.
from concurrent.futures import ProcessPoolExecutor
from time import sleep
values = [3,4,5,6]
def cube(x):
print(f'Cube of {x}:{x*x*x}')
if __name__ == '__main__':
result =[]
with ProcessPoolExecutor(max_workers=5) as exe:
exe.submit(cube,2)
# Maps the method 'cube' with a iterable
result = exe.map(cube,values)
for r in result:
print(r)
I'm new on python. I want to learn how to parallel processing in python. I saw the following example:
import multiprocessing as mp
np.random.RandomState(100)
arr = np.random.randint(0, 10, size=[20, 5])
data = arr.tolist()
def howmany_within_range_rowonly(row, minimum=4, maximum=8):
count = 0
for n in row:
if minimum <= n <= maximum:
count = count + 1
return count
pool = mp.Pool(mp.cpu_count())
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
print(results[:10])
but when I run it, this error happened:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
What should I do?
If you place everything in global scope inside this if __name__ == "__main__" block as follows, you should find that your program behaves as you expect:
def howmany_within_range_rowonly(row, minimum=4, maximum=8):
count = 0
for n in row:
if minimum <= n <= maximum:
count = count + 1
return count
if __name__ == "__main__":
np.random.RandomState(100)
arr = np.random.randint(0, 10, size=[20, 5])
data = arr.tolist()
pool = mp.Pool(mp.cpu_count())
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
print(results[:10])
Without this protection, if your current module was imported from a different module, your multiprocessing code would be executed. This could occur within a non-main process spawned in another Pool and spawning processes from sub-processes is not allowed, hence we protect against this problem.
I had a live example, where I faced the same RuntimeError issue when I executed a specific tool on MacOS-machines (on Linux machines it was fine though). However, I'm not sure about the exact cause for the problem, cause the if __name__ == "__main__" encapsulation seemed to be properly at place.
Following one comment on this Stack-Overflow entry, I suspected that using python>=3.8, which utilizes spawn as default method for calling subprocesses might be the problem.
My solution:
Using python=3.7 did the trick.
Using multiprocessing.pool I can split an input list for a single function to be processed in parallel along multiple CPUs. Like this:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
results = pool.map(f, range(100))
pool.close()
pool.join()
However, this does not allow to run different functions on different processors. If I want to do something like this, in parallel / simultaneously:
foo1(args1) --> Processor1
foo2(args2) --> Processor2
How can this be done?
Edit: After Darkonaut remarks, I do not care about specifically assigning foo1 to Processor number 1. It can be any processor as chosen by the OS. I am just interested in running independent functions in different/ parallel Processes. So rather:
foo1(args1) --> process1
foo2(args2) --> process2
I usually find it easiest to use the concurrent.futures module for concurrency. You can achieve the same with multiprocessing, but concurrent.futures has (IMO) a much nicer interface.
Your example would then be:
from concurrent.futures import ProcessPoolExecutor
def foo1(x):
return x * x
def foo2(x):
return x * x * x
if __name__ == '__main__':
with ProcessPoolExecutor(2) as executor:
# these return immediately and are executed in parallel, on separate processes
future_1 = executor.submit(foo1, 1)
future_2 = executor.submit(foo2, 2)
# get results / re-raise exceptions that were thrown in workers
result_1 = future_1.result() # contains foo1(1)
result_2 = future_2.result() # contains foo2(2)
If you have many inputs, it is better to use executor.map with the chunksize argument instead:
from concurrent.futures import ProcessPoolExecutor
def foo1(x):
return x * x
def foo2(x):
return x * x * x
if __name__ == '__main__':
with ProcessPoolExecutor(4) as executor:
# these return immediately and are executed in parallel, on separate processes
future_1 = executor.map(foo1, range(10000), chunksize=100)
future_2 = executor.map(foo2, range(10000), chunksize=100)
# executor.map returns an iterator which we have to consume to get the results
result_1 = list(future_1) # contains [foo1(x) for x in range(10000)]
result_2 = list(future_2) # contains [foo2(x) for x in range(10000)]
Note that the optimal values for chunksize, the number of processes, and whether process-based concurrency actually leads to increased performance depends on many factors:
The runtime of foo1 / foo2. If they are extremely cheap (as in this example), the communication overhead between processes might dominate the total runtime.
Spawning a process takes time, so the code inside with ProcessPoolExecutor needs to run long enough for this to amortize.
The actual number of physical processors in the machine you are running on.
Whether your application is IO bound or compute bound.
Whether the functions you use in foo are already parallelized (such as some np.linalg solvers, or scikit-learn estimators).
I am running a multiprocessing pool in a for loop over a chuck of data. It runs fine for two iterations and hangs on the third. If I reduce the size of each chuck it hangs later on perhaps the forth or fifth iteration. In the program where I discovered the problem I am running a more extensive function but this works to reproduce the error.
Is there a proper way to terminate a pool after it is finished? So that I can start it again.
import pandas as pd
import numpy as np
from multiprocess import Pool
df = pd.read_csv('paths.csv')
def do_something(user):
v = df[df['userId'] == user]
return v
if __name__ == '__main__':
users = df['userId'].unique()
n_chunks = round(len(users)/40)
subsets = [users[i:i+n_chunks] for i in range(0, len(users), n_chunks)]
chunk_counter = 0
for user_subset in subsets:
chunk_counter += 1
print(f'Beginning to process chunk {chunk_counter}...')
with Pool() as pool:
frames = pool.map(do_something, user_subset)
pool.close()
pool.terminate()
print(f'Completed processing chunk {chunk_counter}.')
I was able to prevent the hanging with the code below:
with Pool(maxtasksperchild=1) as pool:
frames = pool.map_async(do_something, user_subset).get()
pool.terminate()
pool.join()
I don't understand why using map_async would prevent the hanging. I will dive deeper if I have a chance and update if I understand the reason.
I have been fiddling with Python's multiprocessing functionality for upwards of an hour now, trying to parallelize a rather complex graph traversal function using multiprocessing.Process and multiprocessing.Manager:
import networkx as nx
import csv
import time
from operator import itemgetter
import os
import multiprocessing as mp
cutoff = 1
exclusionlist = ["cpd:C00024"]
DG = nx.read_gml("KeggComplete.gml", relabel=True)
for exclusion in exclusionlist:
DG.remove_node(exclusion)
# checks if 'memorizedPaths exists, and if not, creates it
fn = os.path.join(os.path.dirname(__file__),
'memorizedPaths' + str(cutoff+1))
if not os.path.exists(fn):
os.makedirs(fn)
manager = mp.Manager()
memorizedPaths = manager.dict()
filepaths = manager.dict()
degreelist = sorted(DG.degree_iter(),
key=itemgetter(1),
reverse=True)
def _all_simple_paths_graph(item, DG, cutoff, memorizedPaths, filepaths):
source = item[0]
uniqueTreePaths = []
if cutoff < 1:
return
visited = [source]
stack = [iter(DG[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.pop()
elif child in memorizedPaths:
for path in memorizedPaths[child]:
newPath = (tuple(visited) + tuple(path))
if (len(newPath) <= cutoff) and
(len(set(visited) & set(path)) == 0):
uniqueTreePaths.append(newPath)
continue
elif len(visited) < cutoff:
if child not in visited:
visited.append(child)
stack.append(iter(DG[child]))
if visited not in uniqueTreePaths:
uniqueTreePaths.append(tuple(visited))
else: # len(visited) == cutoff:
if (visited not in uniqueTreePaths) and
(child not in visited):
uniqueTreePaths.append(tuple(visited + [child]))
stack.pop()
visited.pop()
# writes the absolute path of the node path file into the hash table
filepaths[source] = str(fn) + "/" + str(source) + "path.txt"
with open (filepaths[source], "wb") as csvfile2:
writer = csv.writer(csvfile2, delimiter=" ", quotechar="|")
for path in uniqueTreePaths:
writer.writerow(path)
memorizedPaths[source] = uniqueTreePaths
############################################################################
if __name__ == '__main__':
start = time.clock()
for item in degreelist:
test = mp.Process(target=_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
test.start()
test.join()
end = time.clock()
print (end-start)
Currently - though luck and magic - it works (sort of). My problem is I'm only using 12 of my 24 cores.
Can someone explain why this might be the case? Perhaps my code isn't the best multiprocessing solution, or is it a feature of my architecture Intel Xeon CPU E5-2640 # 2.50GHz x18 running on Ubuntu 13.04 x64?
EDIT:
I managed to get:
p = mp.Pool()
for item in degreelist:
p.apply_async(_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
p.close()
p.join()
Working, however, it's VERY SLOW! So I assume I'm using the wrong function for the job. hopefully it helps clarify exactly what I'm trying to accomplish!
EDIT2: .map attempt:
partialfunc = partial(_all_simple_paths_graph,
DG=DG,
cutoff=cutoff,
memorizedPaths=memorizedPaths,
filepaths=filepaths)
p = mp.Pool()
for item in processList:
processVar = p.map(partialfunc, xrange(len(processList)))
p.close()
p.join()
Works, is slower than singlecore. Time to optimize!
Too much piling up here to address in comments, so, where mp is multiprocessing:
mp.cpu_count() should return the number of processors. But test it. Some platforms are funky, and this info isn't always easy to get. Python does the best it can.
If you start 24 processes, they'll do exactly what you tell them to do ;-) Looks like mp.Pool() would be most convenient for you. You pass the number of processes you want to create to its constructor. mp.Pool(processes=None) will use mp.cpu_count() for the number of processors.
Then you can use, for example, .imap_unordered(...) on your Pool instance to spread your degreelist across processes. Or maybe some other Pool method would work better for you - experiment.
If you can't bash the problem into Pool's view of the world, you could instead create an mp.Queue to create a work queue, .put()'ing nodes (or slices of nodes, to reduce overhead) to work on in the main program, and write the workers to .get() work items off that queue. Ask if you need examples. Note that you need to put sentinel values (one per process) on the queue, after all the "real" work items, so that worker processes can test for the sentinel to know when they're done.
FYI, I like queues because they're more explicit. Many others like Pools better because they're more magical ;-)
Pool Example
Here's an executable prototype for you. This shows one way to use imap_unordered with Pool and chunksize that doesn't require changing any function signatures. Of course you'll have to plug in your real code ;-) Note that the init_worker approach allows passing "most of" the arguments only once per processor, not once for every item in your degreeslist. Cutting the amount of inter-process communication can be crucial for speed.
import multiprocessing as mp
def init_worker(mps, fps, cut):
global memorizedPaths, filepaths, cutoff
global DG
print "process initializing", mp.current_process()
memorizedPaths, filepaths, cutoff = mps, fps, cut
DG = 1##nx.read_gml("KeggComplete.gml", relabel = True)
def work(item):
_all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths)
def _all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths):
pass # print "doing " + str(item)
if __name__ == "__main__":
m = mp.Manager()
memorizedPaths = m.dict()
filepaths = m.dict()
cutoff = 1 ##
# use all available CPUs
p = mp.Pool(initializer=init_worker, initargs=(memorizedPaths,
filepaths,
cutoff))
degreelist = range(100000) ##
for _ in p.imap_unordered(work, degreelist, chunksize=500):
pass
p.close()
p.join()
I strongly advise running this exactly as-is, so you can see that it's blazing fast. Then add things to it a bit a time, to see how that affects the time. For example, just adding
memorizedPaths[item] = item
to _all_simple_paths_graph() slows it down enormously. Why? Because the dict gets bigger and bigger with each addition, and this process-safe dict has to be synchronized (under the covers) among all the processes. The unit of synchronization is "the entire dict" - there's no internal structure the mp machinery can exploit to do incremental updates to the shared dict.
If you can't afford this expense, then you can't use a Manager.dict() for this. Opportunities for cleverness abound ;-)