Python: Writing multiprocess shared list in background - python

I believe I am about to ask a definite newbie question, but here goes:
I written a python script that does snmp queries. The snmp query function uses a global list as its output.
def get_snmp(community_string, mac_ip):
global output_list
snmp get here
output_list.append(output_string)
The get_snmp querier's are launched using the following code:
pool.starmap_async(get_snmp, zip(itertools.repeat(COMMUNITY_STRING), input_list))
pool.close()
pool.join()
if output_file_name != None:
csv_writer(output_list, output_file_name)
This setup works fine, all of the get_snmp process write their output out to a shared list output_list, and then the csv_write function is called and that list is dumped to disk.
The main issue with this program is on a large run the memory usage can become quite high as the list is being built. I would like to write the results to the text file in the background to keep memory usage down, and I'm not sure how to do it. I went with the global list to eliminate file locking issues.

I think that your main problem with increasing memory usage is that you don't remove contents from that list when writing them to file.
Maybe you should do del output_list[:] after writing it to file.

Have each of the workers write their output to a Queue, then have another worker (or the main thread) read from the Queue and write to a file. That way you don't have to store everything in memory.
Don't write directly to the file from the workers; otherwise you can have issues with multiple processes trying to write to the same file at the same time, which will just give you a headache until you fix it anyway.

Related

Good practice for parallel tasks in python

I have one python script which is generating data and one which is training a neural network with tensorflow and keras on this data. Both need an instance of the neural network.
Since I haven't set the flag "allow growth" each process takes the full GPU memory. Therefore I simply give each process it's own GPU. (Maybe not a good solution for people with only one GPU... yet another unsolved problem)
The actual problem is as follow: Both instances need access to the networks weights file. I recently had a bunch of crashes because both processes tried to access the weights. A flag or something similar should stop each process from accessing it, whilst the other process is accessing. Hopefully this doesn't create a bottle neck.
I tried to come up with a solution like semaphores in C, but today I found this post in stack-exchange.
The idea with renaming seems quite simple and effective to me. Is this good practice in my case? I'll just create the weight file with my own function
self.model.save_weights(filepath='weights.h5$$$')
in the learning process, rename them after saving with
os.rename('weights.h5$$$', 'weights.h5')
and load them in my data generating process with function
self.model.load_weights(filepath='weights.h5')
?
Will this renaming overwrite the old file? And what happens if the other process is currently loading? I would appreciate other ideas how I could multithread / multiprocess my script. Just realized that generating data, learn, generating data,... in a sequential script is not really performant.
EDIT 1: Forgot to mention that the weights are stored in a .h5 file by keras' save function
The multiprocessing module has a RLock class that you can use to regulate access to a sharded resource. This also works for files if you remember to acquire the lock before reading and writing and release it afterwards. Using a lock implies that some of the time one of the processes cannot read or write the file. How much of a problem this is depends on how much both processes have to access the file.
Note that for this to work, one of the scripts has to start the other script as a Process after creating the lock.
If the weights are a Python data structure, you could put that under control of a multiprocessing.Manager. That will manage access to the objects under its control for you. Note that a Manager is not meant for use with files, just in-memory objects.
Additionally on UNIX-like operating systems Python has os.lockf to lock (part of) a file. Note that this is an advisory lock only. That is, if another process calls lockf, the return value indicates that the file is already locked. It does not actually prevent you from reading the file.
Note:
Files can be read and written. Only when two processes are reading the same file (read/read) does this work well. Every other combination (read/write, write/read, write/write) can and eventually will result in undefined behavior and data corruption.
Note2:
Another possible solution involves inter process communication.
Process 1 writes a new h5 file (with a random filename), closes it, and then sends a message (using a Pipe or Queue to Process 2 "I've written a new parameter file \path\to\file".
Process 2 then reads the file and deletes it. This can work both ways but requires that both processes check for and process messages every so often. This prevents file corruption because the writing process only notifies the reading process after it has finished the file.

Spider/Scraper in Python, insert data from multiprocessing

So I'm writing a small spider/scraper in Python that fetches and analyses different URLs using multiple processes.
My question is how should I insert the data gathered in the previous action?
Call a thread from each process? Add them to a global object and insert into the database afterwards? Other options?
Thank you.
one way is to dump the results from each thread to a .csv file in append mode. You can protect your file using a context manager. In this way, you won't lose any data in case your system stops execution due to whatever reason, because all results are saved in the moment when they are available.
I recommend to use the with-statement, whose primary use is an exception-safe cleanup of the object used inside (in this case your .csv). In other words, with makes sure that files are closed, locks released, contexts restored etc.
with open("myfile.csv", "a") as reference: # Drop to csv w/ context manager
df.to_csv(reference, sep = ",", index = False)
# As soon as you are here, reference is closed
My present bumble opinion is to use Pool, for a small spider Pool is enough.
Here is a example:
from multiprocessing.pool import Pool
pool = Pool(20)
pool.map(main, urls) # Encapsulate the original functions into the main function.And input urls.
pool.close()
pool.join()
This the source code
Ps.this is my first answer, I would be glad if was helpful.

Python: Efficient way to move a huge list of data from one list to another

I'm using multiprocessing, and in a separate process (let's call it thread for clarity), I'm using PySerial to pull data from a device. In my program, I have a list that's shared with the main thread. I create the list using:
import multiprocessing as mp
#... stuff
self.mpManager = mp.Manager()
self.shared_return_list = self.mpManager.list()
This list, is filled inside the process, and then this is transferred to the local thread using:
if len(self.acqBuffer) < 50000:
try:
self.shared_result_lock.acquire()
self.acqBuffer.extend(self.shared_return_list)
del self.shared_return_list[:]
finally:
self.shared_result_lock.release()
where acqBuffer is the local list that takes the data for analysis and storage.
The problem is that if the device has lots of data queued, the transfer process will become really, really slow as lots of data is there, and the GUI freezes. Possible solutions are either to transfer the data in chunks and actively keep reviving the GUI, or find a smart way to transfer the data, which is what I'm asking about.
In C++, I would use some derivative of std::deque or std::list (which is not necessarily contiguous in memory) with a move constructor and use std::move to just push the pointer to the data, instead of re-copying the whole data to the main thread. Is such a thing possible in my case in Python? Are there smarter ways to do this?
Thank you.

CPU (all cores) become idle during python multiprocessing on windows

My system is windows 7. I wrote python program to do data analysis. I use multiprocessing library to achieve parallelism. When I open windows powershell, and type python MyScript.py. It starts to use all the cpu cores. But after a while, the CPU (all cores) became idle. But if I hit Enter in powershell window, all cores are back to full-load. To be clear, the program is fine, and has been tested. The problem here is that CPU-cores went idle by themselves.
This happened not only on my office computer, which runs Windows 7 Pro, but also on my home desktop, which runs Windows 7 Ultimate.
The parallel part of the program is very simple:
def myfunc(input):
##some operations based on a huge data and a small data##
operation1: read in a piece of HugeData #query based HDF5
operation2: some operation based on HugeData and SmallData
return output
# read in Small data
SmallData=pd.read_csv('data.csv')
if __name__ == '__main__':
pool = mp.Pool()
result=pool.map_async(myfunc, a_list_of_input)
out=result.get()
My function are mainly data manipulations using Pandas.
There is nothing wrong with the program, because I've successfully finished my program couple times. But I have to keep watching it, and hit Enter when cores become idle. The job takes couple hours, and I really don't keep watching it.
Is this a problem of windows system itself or my program?
By the way, can all the cores have access to the same variable stored in the memory? e.g. I have a data set mydata read into memory right before if __name__ == '__main__':. This data will be used in myfunc. All the cores should be able to access mydata in the same time, right?
Please help!
I was re-directed to this question as I was facing a similar problem while using Python's Multiprocessing library in Ubuntu. In my case, the processes do not start by hitting enter or such, however, they start after sometime abruptly. My code is an iterative heuristic that uses multiprocessing in each of its iterations. I have to rerun the code after completion of some iterations in order to get a steady runtime-performance. As the question was posted way long back, did you come across the actual reason behind it and solution to it?
I confess to not understanding the subtleties of map_async, but I'm not sure whether you can use it like that (I can't seem to get it to work at all)...
I usually use the following recipe (a list comprehension of the calls I want doing):
In [11]: procs = [multiprocessing.Process(target=f, args=()) for _ in xrange(4)]
....: for p in procs: p.start()
....: for p in procs: p.join()
....:
It's simple and waits until the jobs are finished before continuing.
This works fine with pandas objects provided you're not doing modifications... (I think) copies of the object are passed to each thread and if you perform mutations they not propogate and will be garbage collected.
You can use multiprocessing's version of a dict or list with the Manager class, this is useful for storing the result of each job (simply access the dict/list from within the function):
mgr = multiproccessing.Manager()
d = mgr.dict()
L = mgr.list()
and they will have shared access (as if you had written a Lock). It's hardly worth mentioning, that if you are appending to a list then the order will not just be the same as procs!
You may be able to do something similar to the Manager for pandas objects (writing a lock to objects in memory without copying), but I think this would be a non-trivial task...

Slower Execution with Python multiprocessing

I may be in a little over my head here, but I am working on a little bioinformatics project in python. I am trying to parallelism a program that analyzes a large dictionary of sets of strings (~2-3GB in RAM). I find that the multiprocessing version is faster when I have smaller dictionaries but is of little benefit and mostly slower with the large ones. My first theory was that running out of memory just slowed everything and the bottleneck was from swapping into virtual memory. However, I ran the program on a cluster with 4*48GB of RAM and the same slowdown occurred. My second theory is that access to certain data was being locked. If one thread is trying to access a reference currently being accessed in another thread, will that thread have to wait? I have tried creating copies of the dictionaries I want to manipulate, but that seems terribly inefficient. What else could be causing my problems?
My multiprocessing method is below:
def worker(seqDict, oQueue):
#do stuff with the given partial dictionary
oQueue.put(seqDict)
oQueue = multiprocessing.Queue()
chunksize = int(math.ceil(len(sdict)/4)) # 4 cores
inDict = {}
i=0
dicts = list()
for key in sdict.keys():
i+=1
if len(sdict[key]) > 0:
inDict[key] = sdict[key]
if i%chunksize==0 or i==len(sdict.keys()):
print(str(len(inDict.keys())) + ", size")
dicts.append(copy(inDict))
inDict.clear()
for pdict in dicts:
p =multiprocessing.Process(target = worker,args = (pdict, oQueue))
p.start()
finalDict = {}
for i in range(4):
finalDict.update(oQueue.get())
return finalDict
As I said in the comments, and as Kinch said in his answer, everything passed through to a subprocess has to be pickled and unpickled to duplicate it in the local context of the spawned process. If you use multiprocess.Manager.dict for sDict (thereby allowing processes to share the same data through a server that proxies the objects created on it) and spawning the processes with slice indices in to that shared sDict, that should cut down on the serialize/deserialize sequence involved in spawning the child processes. You still might hit bottlenecks with that though in the server communication step of working with the shared objects. If so, you'll have to look at simplifying your data so you can use true shared memory with multiprocess.Array or multiprocess.Value or look at multiprocess.sharedctypes to create custom datastructures to share between your processes.
Seems like the data from the "large dictionary of sets of strings" could be reformatted into a something that could be stored in a file or string, allowing you to use the mmap module to share it among all the processes. Each process might incur some startup overhead if it needs to convert the data back into some other more preferable form, but that could be minimized by passing each process something indicating what subset of the whole dataset in shared memory they should do their work on and only reconstitute the part required by that process.
Every data which is passed through the queue will be serialized and deserialized using pickle. I would guess this could be a bottleneck if you pass a lot of data round.
You could reduce the amount of data, make use of shared memory, write a multi-threading version in a c extension or try a multithreading version of this with a multithreading safe implemention of python (maybe jython or pypy; I don't know).
Oh and by the way: You are using multiprocessing and not multithreading.

Categories

Resources