I need win32com.client.dispatch as multiprocess - python

I want to dispatch a COM object using python win32com.client.DispatchEx('***Something***')
I want it done as multiple processes,
but currently when I launch this process twice, it always runs as a single process.
When I run the DispatchEx, I need two processes to be created in task manager with two process ID's.

Maybe this will be useful to you, I test many pages in a similar way, opening them in a separate Chrome.webdriver and it works fine, the pages are independent of each other.
import concurrent.futures
name = ['James', 'Arnold', 'Jash', 'Johny', 'Harry', 'Michael']
def configer(name):
i = 0
b = 0
while i == 0:
print(b, name, '\n')
b += 1
if b == 10:
i += 1
size = 6
with concurrent.futures.ThreadPoolExecutor(size) as thp:
thp.map(configer, name)
If you set the variable size to 2, you do not need to provide the function argument and the configer function will be executed twice simultaneously.
You can also create an if statement in the configer function, which executes if name = 'some_value', then you can start two threads simultaneously doing something completely different
This method above works with a high CPU load, so be careful of the CPU memory overflow

Related

multiprocessing stops program execution - python

I'm a noob with multiprocessing, and I'm trying to speed up an old algorithm of mine. It works perfectly fine, without multipocessing, but in the moment I try to implement it, the program stop working: it stands by untill I abort the script.
Another issue is that it doesn't populate the dataframe: again, normally it works, but with multiprocessing it returns only NaN.
func works well.
stockUniverse = list(map(lambda s: s.strip(), Stocks)) #Stocks = list
def func(i):
df.at[i, 'A'] = 1
df.at[i, 'B'] = 2
df.at[i, 'C'] = 3
print(i, 'downloaded')
return True
if __name__ == "__main__":
print('Start')
pool = mp.Pool(mp.cpu_count())
pool.imap(func, stockUniverse)
print(df)
the result is:
Index 19 NaN NaN NaN
index 20 NaN NaN NaN
And then it stops there until I hit Ctrl+C.
Thanks
The map function blocks until all the submitted tasks have completed and returns a list of the return values from the worker function. But the imap function returns immediately with an iterator that must be iterated to return the return values one by one as each becomes available. Your original code did not iterate that iterator but instead immediately printed out what it expected was the updated df. But you would not have given the tasks enough time to start and complete for df to have been modified. In theory if you had inserted before the print statement a call to time.sleep for a sufficiently long enough time, then the tasks would have started and completed before you printed out df. But clearly iterating the iterator is the most efficient way of being sure all tasks have completed and the only way of getting return values back.
But, as I mentioned in my comment, you have a much bigger problem. The tasks you submitted are executed by worker function func being called by processes in the process pool that you created, which are each executing in their own address space. You did not tag your question with the platform on which you are running (whenever you tag a question with multiprocessing, you are suppose to also tag the question with the platform), but I might infer that you are running under a platform that uses the spawn method to create new processes, such as Windows, and that is why you have the if __name__ == "__main__": block controlling code that creates new processes (i.e. the processing pool). When spawn is used to create new processes, a new, empty address space is created, a new Python interpreter is launched and the source is re-executed from the top (without the if __name__ == "__main__": block controlling code that creates new processes, you would get into an infinite, recursive loop creating new processes). But this means that any definition of df at global scope made outside the if __name__ == "__main__": block (which, you must have omitted if you are running under Windows) will be creating a new, separate instance for each process in the pool as each process is created.
If you are instead running under Linux, where fork is used to create new processes, the story is a bit different. The new processes will inherit the original address space from the main process and all declared variables, but copy on write is used. That means that once a subprocess attempts to modify any variable in this inherited storage, a copy of the page is made and the process will now be working on its own copy. So again, nothing can be shared for updating purposes.
You should therefore modify your program to have your worker function return values back to the main process, which will do the necessary updating:
import multiprocessing as mp
import pandas as pd
def func(stock):
return (stock, (('A', 1), ('B', 1), ('C', 1)))
if __name__ == "__main__":
stockUniverse = ['abc', 'def', 'ghi', 'klm']
d = {col: pd.Series(index=stockUniverse, dtype='int32') for col in ['A', 'B', 'C']}
df = pd.DataFrame(d)
pool_size = min(mp.cpu_count(), len(stockUniverse))
pool = mp.Pool(pool_size)
for result in pool.imap_unordered(func, stockUniverse):
stock, col_values = result # unpack
for col_value in col_values:
col, value = col_value # unpack
df.at[stock, col] = value
print(df)
Prints:
A B C
abc 1 1 1
def 1 1 1
ghi 1 1 1
klm 1 1 1
Note that I have used imap_unordered instead of imap. The former method is allowed to return the results in arbitrary order (i.e. as they become available) and is generally more efficient and since the return value contains all the information required for setting the correct row of df, we no longer require any specific ordering.
But:
If your worker function is doing largely nothing but downloading from a website and very little CPU-intensive processing, then you could (should) be using a thread pool by making the simple substitution of:
from multiprocessing.pool import ThreadPool
...
MAX_THREADS_TO_USE = 100 # or maybe even larger!!!
pool_size = min(MAX_THREADS_TO_USE, len(stockUniverse))
pool = ThreadPool(pool_size)
And since all threads share the same address space, you could use your original worker function, func as is!

Mulitprocessing pool for function with no arguments/iterable?

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

Most efficient way to multiprocess separate functions over same object

In a python script, I have a large dataset that I would like to apply multiple functions to. The functions are responsible for creating certain outputs that get saved to the hard drive.
A few things of note:
the functions are independent
none of the functions return anything
the functions will take variable amounts of time
some of the functions may fail, and that is fine
Can I multiprocess this in any way that each function and the dataset are sent separately to a core and run there? This way I do not need the first function to finish before the second one can kick off? There is no need for them to be sequentially dependent.
Thanks!
Since your functions are independent and only read data, as long as it is not an issue if your data is modified during the execution of a function, then they are also thread safe.
Use a thread pool (click) . You would have to create a task per function you want to run.
Note: In order for it to run on more than one core you must use Python Multiprocessing. Else all the threads will run on a single core. This happens because Python has a Global Interpreter Lock (GIL). For more information Python threads all executing on a single core
Alternatively, you could use DASK , which augments the data in order to run some multi threading. While adding some overhead, it might be quicker for your needs.
I was in a similar situation as yours, and used Processes with the following function:
import multiprocessing as mp
def launch_proc(nproc, lst_functions, lst_args, lst_kwargs):
n = len(lst_functions)
r = 1 if n % nproc > 0 else 0
for b in range(n//nproc + r):
bucket = []
for p in range(nproc):
i = b*nproc + p
if i == n:
break
proc = mp.Process(target=lst_functions[i], args=lst_args[i], kwargs=lst_kwargs[i])
bucket.append(proc)
for proc in bucket:
proc.start()
for proc in bucket:
proc.join()
This has a major drawback: all Processes in a bucket have to finish before a new bucket can start. I tried to use a JoinableQueue to avoid this, but could not make it work.
Example:
def f(i):
print(i)
nproc = 2
n = 11
lst_f = [f] * n
lst_args = [[i] for i in range(n)]
lst_kwargs = [{}] * n
launch_proc(nproc, lst_f, lst_args, lst_kwargs)
Hope it can help.

python for large data processing

I relatively new to python, and have been able to answer most of my questions based on similar problems answered on forms, but I'm at a point where I'm stuck an could use some help.
I have a simple nested for loop script that generates an output of strings. What I need to do next is have each grouping go through a simulation, based on numerical values that the strings will be matched too.
really my question is how do I go about this in the best way? Im not sure if multithreading will work since the strings are generated and then need to undergo the simulation, one set at a time. I was reading about queue's and wasn't sure if they could be passed into queue's for storage and then undergo the simulation, in the same order they entered the queue.
Regardless of the research I've done I'm open to any suggestion anyone can make on the matter.
thanks!
edit: Im not look for an answer on how to do the simulation, but rather a way to store the combinations while simulations are being computed
example
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
print(D)
As was hinted at in the comments, the multiprocessing module is what you're looking for. Threading won't help you because of the Global Interpreter Lock (GIL), which limits execution to one Python thread at a time. In particular, I would look at multiprocessing pools. These objects give you an interface to have a pool of subprocesses do work for you in parallel with the main process, and you can go back and check on the results later.
Your example snippet could look something like this:
import multiprocessing
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
pool = multiprocessing.pool() # by default, this will create a number of workers equal to
# the number of CPU cores you have available
combination_list = [] # create a list to store the combinations
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
combination_list.append(D) # append this combination to the list
results = pool.map(simulation_function, combination_list)
# simulation_function is the function you're using to actually run your
# simulation - assuming it only takes one parameter: the combination
The call to pool.map is blocking - meaning that once you call it, execution in the main process will halt until all the simulations are complete, but it is running them in parallel. When they complete, whatever your simulation function returns will be available in results, in the same order that the input arguments were in the combination_list.
If you don't want to wait for them, you can also use apply_async on your pool and store the result to look at later:
import multiprocessing
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
pool = multiprocessing.pool()
result_list = [] # create a list to store the simulation results
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
result_list.append(pool.apply_async(
simulation_function,
args=(D,))) # note the extra comma - args must be a tuple
# do other stuff
# now iterate over result_list to check the results when they're ready
If you use this structure, result_list will be full of multiprocessing.AsyncResult objects, which allow you to check if they are ready with result.ready() and, if it's ready, retrieve the result with result.get(). This approach will cause the simulations to be kicked off right when the combination is calculated, instead of waiting until all of them have been calculated to start processing them. The downside is that it's a little more complicated to manage and retrieve the results. For example, you have to make sure the result is ready or be ready to catch an exception, you need to be ready to catch exceptions that may have been raised in the worker function, etc. The caveats are explained pretty well in the documentation.
If calculating the combinations doesn't actually take very long and you don't mind your main process halting until they're all ready, I suggest the pool.map approach.

Python3 sharing an array between parent/child processes

https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Array
What I’m trying to do
Create an array in MainProcess and send it through inheritance to any subsequent child processes. The child processes will change the array. The parent process will look out for the changes and act accordingly.
The problem
The parent process does not "see" any changes done by the child processes. However the child processes do "see" the changes. Ie if child 1 adds an item then child 2 will see that item etc
This is true for sARRAY and iARRAY, and iVALUE.
BUT
While the parent process seems to be oblivious to the array values it does take notice of the changes done to the iVALUE.
I don’t understand what I’m doing wrong.
UPDATE 2 https://stackoverflow.com/a/6455704/1267259
The main source of confusion is that multiprocessing uses separate processes and not threads. This means that any changes to object state
made by the children aren't automatically visible to the parent.
To clarify. What I want to do is possible, right?
https://stackoverflow.com/a/26554759/1267259
I mean that's the purpose with multiprocessing Array and Value, to communicate between children and parent processes? And iVALUE works so...
I’ve found this Shared Array not shared correctly in python multiprocessing
But I don’t understand the answer "Assigning to values that have meaning in all processes seems to help:"
UPDATE 1
Found
Python : multiprocessing and Array of c_char_p
> "the assignment to arr[i] points arr[i] to a memory address that was
only meaningful to the subprocess making the assignment. The other
subprocesses retrieve garbage when looking at that address."
As I understand it this doesn't apply to this problem. The assignment
by one subprocess to the array does make sense to the other
subprocesses in this case. But why doesn't it make sense for the main
process?
And I am aware of "managers" but it feels like Array should suffice for this use case. I've read the manual but obviously I don't seem to get it.
UPDATE 3 Indeed, this works
manage = multiprocessing.Manager()
manage = list(range(3))
So...
What am I doing wrong?
import multiprocessing
import ctypes
class MainProcess:
# keep track of process
iVALUE = multiprocessing.Value('i',-1) # this works
# keep track of items
sARRAY = multiprocessing.Array(ctypes.c_wchar_p, 1024) # this works between child processes
iARRAY = multiprocessing.Array(ctypes.c_int, 3) # this works between child processes
pLOCK = multiprocessing.Lock()
def __init__(self):
# create an index for each process
self.sARRAY.value = [None] * 3
self.iARRAY.value = [None] * 3
def InitProcess(self):
# list of items to process
items = []
item = (i for i in items)
with(multiprocessing.Pool(3)) as pool:
# main loop: keep looking for updated values
while True:
try:
pool.apply_async(self.worker, (next(item),callback=eat_finished_cake))
except StopIteration:
pass
print(self.sARRAY) # yields [None][None][None]
print(self.iARRAY) # yields [None][None][None]
print(self.iVALUE) # yields 1-3
pool.close()
pool.join()
def worker(self,item):
with self.pLOCK:
self.iVALUE.value += 1
self.sARRAY.value[self.iVALUE.value] = item # value: 'item 1'
self.iARRAY.value[self.iVALUE.value] = 2
# on next child process run
print(self.iVALUE.value) # prints 1
print(self.sARRAY.value) # prints ['item 1'][None][None]
print(self.iARRAY.value) # prints [2][None][None]
sleep(0.5)
...
with self.pLOCK:
self.iVALUE.value -= 1
UPDATE 4
Changing
pool.apply_async(self.worker, (next(item),))
To
x = pool.apply_async(self.worker, (next(item),))
print(x.get())
Or
x = pool.apply(self.worker, (next(item),))
print(x)
And in self. worker() returning self.iARRAY.value or self.sARRAY.value does return a variable that has the updated value. This is not what I want to achieve though, this doesn't event require the use of ARRAY to achive...
So I need to clarify. In the self.worker() I'm doing important heavy lifting that can take a long time and I need to send back information to the main process, eg the progress before the return value is finished to be sent to the callback.
I don't expect the return of the finished worked result to the main method/that is to be handled by the callback function. I see now that omitting the callback in the code example could've give a different impression sorry.
UPDATE 5
Re: Use numpy array in shared memory for multiprocessing
I've seen that answer and tried a variation of it using initilaizer() with a global var and passed array through initargs with no luck. I don't understand the use of nymphs and with "closing()" but that code doesn't seem to access the "arr" inside main() although shared_arr is used, but only after p.join().
As far as I can see the array is declared then turned to a nymph and inherited through init(x). My code should have the same behavior as that code so far.
One major difference seems to be how the array is accessed
I've only succeeded setting and getting array value using the attribute value, when I tried
self.iARRAY[0] = 1 # instead of iARRAY.value = [None] * 3
self.iARRAY[1] = 1
self.iARRAY[2] = 1
print(self.iARRAY) # prints <SynchronizedArray wrapper for <multiprocessing.sharedctypes.c_int_Array_3 object at 0x7f9cfa8538c8>>
And I can't find a method to access and check the values (the attribute "value" gives an unknown method error)
Another major difference from that code is the prevention of data copying using the get_obj().
Isn't this a nymphy issue?
assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig)
Not sure how to make use of that.
def worker(self,item):
with self.pLOCK:
self.iVALUE.value += 1
self.sARRAY.value[self.iVALUE.value] = item # value: 'item 1'
with self.iARRAY.get_lock():
arr = self.iARRAY.get_obj()
arr[self.iVALUE.value] = 2 # and now ???
sleep(0.5)
...
with self.pLOCK:
self.iVALUE.value -= 1
UPDATE 6
I've tried using multiprocessing.Process() instead of Pool() but the result is the same.
correct way to declare the global variable (in this case class attribute)
iARRAY = multiprocessing.Array(ctypes.c_int, range(3))
correct way to set value
self.iARRAY[n] = x
correct way to get value
self.iARRAY[n]
Not sure why the examples I've seen had used Array(ctypes.c_int, 3) and iARRAY.value[n] but in this case that was wrong
This is your problem:
while True:
try:
pool.apply_async(self.worker, (next(item),))
except StopIteration:
pass
print(self.sARRAY) # yields [None][None][None]
print(self.iARRAY) # yields [None][None][None]
print(self.iVALUE) # yields 1-3
The function pool.apply_async() starts the subprocess running and returns immediately. You don't appear to be waiting for the workers to finish. For that, you might consider using a barrier.

Categories

Resources