I am using multiprocessing for a project which involves going to a URL. I have noticed that whenever I use pool.imap_unordered() whatever my iterator is (lets say it is a list with the number 1 and 2, that are 2 numbers), it will run the program once with one thread, then because there are 2 numbers in the list, it will run another time. I can't seem to figure this out. I thought I understood what everything should be doing. (no, it doesn't run any faster no matter how many threads I have) (the args.urls is originally a file then I convert all the content in the file to a list) everything worked fine until I added multiprocessing so I know it couldn't be an error in my non-multiprocessing related code.
from multiprocessing import Pool
import multiprocessing
import requests
arrange = [ lines.replace("\n", "") for lines in #file ]
def check():
for lines in arrange:
requests.get(lines)
def main():
pool = ThreadPool(4)
results = pool.imap_unordered(check, arrange)
So I'm not entirely sure what you are trying to do but maybe this is what you need:
from multiprocessing import ThreadPool
import multiprocessing
import requests
arrange = [ line.replace("\n", "") for line in #file ]
def check(line):
requests.get(line) # remove the loop, since you are using multiprocessing this is not needed as you pass only one of the lines per thread.
def main():
pool = ThreadPool(4)
results = pool.imap_unordered(check, arrange) # This loops through arrange and provides the check function a single line per call
Related
This question has been asked and solved a few times recently but I have quite a specific example...
I have a multiprocessing function that was working absolutely fine in complete isolation yesterday (in an interactive notebook), however, I decided to parameterise so I can call it as part of a larger pipeline & for abstraction/cleaner notebook and now it's only using a single thread instead of 6.
import pandas as pd
import multiprocessing as mp
from multiprocessing import get_context
mp.set_start_method('forkserver')
def multiprocess_function(func, iterator, input_data):
result_list = []
def append_result(result):
result_list.append(result)
with get_context('fork').Pool(processes=6) as pool:
for i in iterator:
pool.apply_async(func, args = (i, input_data), callback = append_result)
pool.close()
pool.join()
return result_list
multiprocess_function(count_live, run_weeks, base_df)
My previous version of the code executed differently, instead of a return / call I was using the following at the bottom of the function (which doesn't work at all now I've parameterised - even with the args assigned)
if __name__ == '__main__':
multiprocess_function()
The function executes fine, just only operates across one thread as per the output in top.
Apologies if this is something incredibly simple - I'm not a programmer, I'm an analyst :)
edit: everything works absolutely fine if I include the if__name__ =='main': etc at the bottom of the function and execute the cell, however, when I do this I have to remove the parameters - maybe just something to do with scoping. If I execute by calling the function, whether it is parameterised or not, it only operates on a single thread.
You've got two problems:
You're not using an import guard.
You're not setting the default start method inside the import guard.
Between the two of them, you end up telling Python to spawn the forkserver inside the forkserver, which can only cause you grief. Change the structure of your code to:
import pandas as pd
import multiprocessing as mp
from multiprocessing import get_context
def multiprocess_function(func, iterator, input_data):
result_list = []
with get_context('fork').Pool(processes=6) as pool:
for i in iterator:
pool.apply_async(func, args=(i, input_data), callback=result_list.append)
pool.close()
pool.join()
return result_list
if __name__ == '__main__':
mp.set_start_method('forkserver')
multiprocess_function(count_live, run_weeks, base_df)
Since you didn't show where you got count_live, run_weeks and base_df from, I'll just say that for the code as written, they should be defined in the guarded section (since nothing relies on them as a global).
There are other improvements to be made (apply_async is being used in a way that makes me thing you really just wanted to listify the result of pool.imap_unordered, without the explicit loop), but that's fixing the big issues that will wreck use of spawn or forkserver start methods.
using "get_context('spawn') " instead of "get_context('fork')" maybe will solve your problem
I have a python generator that returns lots of items, for example:
import itertools
def generate_random_strings():
chars = "ABCDEFGH"
for item in itertools.product(chars, repeat=10):
yield "".join(item)
I then iterate over this and perform various tasks, the issue is that I'm only using one thread/process for this:
my_strings = generate_random_strings()
for string in my_strings:
# do something with string...
print(string)
This works great, I'm getting all my strings, but it's slow. I would like to harness the power of Python multiprocessing to "divide and conquer" this for loop. However, of course, I want each string to be processed only once. While I've found much documentation on multiprocessing, I'm trying to find the most simple solution for this with the least amount of code.
I'm assuming each thread should take a big chunk of items every time and process them before coming back and getting another big chunk etc...
Many thanks,
Most simple solution with least code? multiprocessing context manager.
I assume that you can put "do something with string" into a function called "do_something"
from multiprocessing import Pool as ProcessPool
number_of_processes = 4
with ProcessPool(number_of_processes) as pool:
pool.map(do_something, my_strings)
If you want to get the results of "do_something" back again, easy!
with ProcessPool(number_of_processes) as pool:
results = pool.map(do_something, my_strings)
You'll get them in a list.
Multiprocessing.dummy is a syntactic wrapper for process pools that lets you use the multiprocessing syntax. If you want threads instead of processes, just do this:
from multiprocessing.dummy import Pool as ThreadPool
You may use multiprocessing.
import multiprocessing
def string_fun(string):
# do something with string...
print(string)
my_strings = generate_random_strings()
num_of_threads = 7
pool = multiprocessing.Pool(num_of_threads)
pool.map(string_fun, my_strings)
Assuming you're using the lastest version of Python, you may want to read something about asyncio module. Multithreading is not easy to implement due to GIL lock: "In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe."
So you can swap on Multiprocessing, or, as reported above, take a look at asycio module.
asyncio — Asynchronous I/O > https://docs.python.org/3/library/asyncio.html
I'll integrate this answer with some code as soon as possible.
Hope it helps,
Hele
As #Hele mentioned, asyncio is best of all, here is an example
Code
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# python 3.7.2
from asyncio import ensure_future, gather, run
import random
alphabet = 'ABCDEFGH'
size = 1000
async def generate():
tasks = list()
result = None
for el in range(1, size):
task = ensure_future(generate_one())
tasks.append(task)
result = await gather(*tasks)
return list(set(result))
async def generate_one():
return ''.join(random.choice(alphabet) for i in range(8))
if __name__ == '__main__':
my_strings = run(generate())
print(my_strings)
Output
['CHABCGDD', 'ACBGAFEB', ...
Of course, you need to improve generate_one, this variant is very slow.
You can see source code here.
I'm using multiprocessing to do a large number of calculations on a set of data to decrease the calculation time. It's working fantastically, except for one small caveat, when I have my listener process writing my outputs, it comes out in the wrong order, which is decidely bad. I need it to all come out in the same order it's going in. Not sure how to achieve this. Sample code is below.
import numpy, os, multiprocessing
from multiprocessing.sharedctypes import Value, Array, RawArray, RawValue
from multiprocessing import Process, Lock
def domorestuff(value):
value += value # sample, some other calculation
q.put(value)
return
def dostuff(somevalue):
somevalue += 1 # do some calculation instead of just +=1 here
domorestuff(somevalue)
return
def listener(q):
f = open(os.path.join(outdir, fileout.value), 'w')
while 1:
#print("Listener...", flush=True)
m = q.get()
if(m == 'kill'):
break
#print("Listen write...", flush=True)
f.write(str(m) + '\n')
f.flush()
f.close()
def main():
manager = multiprocessing.Manager()
q = manager.Queue()
pool = multiprocessing.Pool(9)
watcher = pool.apply_async(listener, (q,))
pool.map(dostuff, range(8))
q.put('kill')
pool.close()
I'd expect it to give me a linear set of values in the file, i.e.:
2, 4, 6, 8, 10, 12, 14, 18
But instead they come out in a random order every time. At a loss how to sync things up, when I don't use a listener and am not doing file writing, it seems to join processes by the number of threads, in order. But it is hard to tell for sure, since I can't safely write the output from many threads to a single file.
To make it a bit more clear the processing happens to an input file, which each thread reads the part it needs, and then writes an output based on the processing to the listener. But rather than getting the chunks in order, as mentioned above it comes out in random ordered chunks.
you are running your processes asynchronously. You cannot expect that these independent processes process/finish their task in any expected order.
#M.Rau is not actually right, you could run the jobs in the pool and join them back together preserving the order, and luckily the multiprocessing module has this feature build in using either pool.apply_async or pool.imap.
I cleaned your code a little bit (note that the queue is completely gone) and this is what i came up with:
import numpy, os, multiprocessing
def domorestuff(value):
return value + value # sample, some other calculation
def dostuff(somevalue):
somevalue += 1 # do some calculation instead of just +=1 here
return domorestuff(somevalue)
def main():
pool = multiprocessing.Pool(9)
out = list(pool.imap(dostuff, range(8)))
pool.close()
print (out)
For more information take a look at an example from the official docs. They have different techniques explained right there. By the way your python code from the question does not even compile and the listener function is irrelevant. Hopefully this helps!
I have a code that works with Thread in python, but I wanna switch to Process as if I have understood well that will give me a speed-up.
Here there is the code with Thread:
threads.append(Thread(target=getId, args=(my_queue, read)))
threads.append(Thread(target=getLatitude, args=(my_queue, read)))
The code works putting the return in the Queue and after a join on the threads list, I can retrieve the results.
Changing the code and the import statement my code now is like that:
threads.append(Process(target=getId, args=(my_queue, read)))
threads.append(Process(target=getLatitude, args=(my_queue, read)))
However it does not execute anything and the Queue is empty, with the Thread the Queue is not empty so I think it is related to Process.
I have read answers in which the Process class does not work on Windows is it true, or there is a way to make it work (adding freeze_support() does not help)?
In the negative case, multithreading on windows is actually executed in parallel on different cores?
ref:
Python multiprocessing example not working
Python code with multiprocessing does not work on Windows
Multiprocessing process does not join when putting complex dictionary in return queue
(in which is described that fork does not exist on Windows)
EDIT:
To add some details:
the code with Process is actually working on centOS.
EDIT2:
add a simplified version of my code with processes, code tested on centOS
import pandas as pd
from multiprocessing import Process, freeze_support
from multiprocessing import Queue
#%% Global variables
datasets = []
latitude = []
def fun(key, job):
global latitude
if(key == 'LAT'):
latitude.append(job)
def getLatitude(out_queue, skip = None):
latDict = {'LAT' : latitude}
out_queue.put(latDict)
n = pd.read_csv("my.csv", sep =',', header = None).shape[0]
print("Number of baboon:" + str(n))
read = []
for i in range(0,n):
threads = []
my_queue = Queue()
threads.append(Process(target=getLatitude, args=(my_queue, read)))
for t in threads:
freeze_support() # try both with and without this line
t.start()
for t in threads:
t.join()
while not my_queue.empty():
try:
job = my_queue.get()
key = list(job.keys())
fun(key[0],job[key[0]])
except:
print("END")
read.append(i)
Per the documentation, you need the following after the function definitions. When Python creates the subprocesses, they import your script so the code that runs at the global level will be run multiple times. For the code you only want to run in the main thread:
if __name__ == '__main__':
n = pd.read_csv("my.csv", sep =',', header = None).shape[0]
# etc.
Indent the rest of code under this if.
I'm trying to make an expensive part of my pandas calculations parallel to speed up things.
I've already managed to make Multiprocessing.Pool work with a simple example:
import multiprocessing as mpr
import numpy as np
def Test(l):
for i in range(len(l)):
l[i] = i**2
return l
t = list(np.arange(100))
L = [t,t,t,t]
if __name__ == "__main__":
pool = mpr.Pool(processes=4)
E = pool.map(Test,L)
pool.close()
pool.join()
No problems here. Now my own algorithm is a bit more complicated, I can't post it here in its full glory and terribleness, so I'll use some pseudo-code to outline the things I'm doing there:
import pandas as pd
import time
import datetime as dt
import multiprocessing as mpr
import MPFunctions as mpf --> self-written worker functions that get called for the multiprocessing
import ClassGetDataFrames as gd --> self-written class that reads in all the data and puts it into dataframes
=== Settings
=== Use ClassGetDataFrames to get data
=== Lots of single-thread calculations and manipulations on the dataframe
=== Cut dataframe into 4 evenly big chunks, make list of them called DDC
if __name__ == "__main__":
pool = mpr.Pool(processes=4)
LLT = pool.map(mpf.processChunks,DDC)
pool.close()
pool.join()
=== Join processed Chunks LLT back into one dataframe
=== More calculations and manipulations
=== Data Output
When I'm running this script the following happens:
It reads in the data.
It does all calculations and manipulations until the Pool statement.
Suddenly it reads in the data again, fourfold.
Then it goes into the main script fourfold at the same time.
The whole thing cascades recursively and goes haywire.
I have read before that this can happen if you're not careful, but I do not know why it does happen here. My multiprocessing code is protected by the needed name-main-statement (I'm on Win7 64), it is only 4 lines long, it has close and join statements, it calls one defined worker function which then calls a second worker function in a loop, that's it. By all I know it should just create the pool with four processes, call the four processes from the imported script, close the pool and wait until everything is done, then just continue with the script. On a sidenote, I first had the worker functions in the same script, the behaviour was the same. Instead of just doing what's in the pool it seems to restart the whole script fourfold.
Can anyone enlighten me what might cause this behaviour? I seem to be missing some crucial understanding about Python's multiprocessing behaviour.
Also I don't know if it's important, I'm on a virtual machine that sits on my company's mainframe.
Do I have to use individual processes instead of a pool?
I managed to make it work by enceasing the entire script into the if __name__ == "__main__":-statement, not just the multiprocessing part.