How to save up memory while using Multiprocessing in Python?

How to save up memory while using Multiprocessing in Python? - python

I've got a function that takes a node id of a graph as input and calculate something in the graph(without altering the graph object), then it saves the results on the filesystem, my code looks like this:
...
# graph file is being loaded
g = loadGraph(gfile='data/graph.txt')
# list of nodeids is being loaded
nodeids = loadSeeds(sfile='data/seeds.txt')
import multiprocessing as mp
# parallel part of the code
print ("entering the parallel part ..")
num_workers = mp.cpu_count() # 4 on my machine
p = mp.Pool(num_workers)
# _myParallelFunction(nodeid) {calculate something for nodeid in g and save it into a file}
p.map(_myParallelFunction, nodeids)
p.close()
...
The problem is when I load the graph into Python it takes lots of memory(about 2G, it's a big graph with thousands of nodes actually), but when it starts to go into the parallel part of the code(the parallel map function execution) it seems that every process is given a separate copy of g and I simply run out of memory on my machine(it's got 6G ram and 3G swap), so I wanted to see that is there a way to give each process the same copy of g so that only the memory to hold one copy of it would be required? any suggestions are appreciated and thanks in advance.

If dividing the graph into smaller parts does not work, you may be able to find a solution using this or multiprocessing.sharedctypes, depending on what kind of object your graph is.

Your comment indicates that you are processing a single node at a time:
# _myParallelFunction(nodeid) {calculate something for nodeid in g and save it into a file}
I would create a generator function that returns a single node from the graph file each time it's called, and pass that generator to the p.map() function instead of the entire list of nodeids.

Related

PBS vmem exceeded limit: How can I know where the memory exceeds?

I have a shapefile with 1500000 polygons, I need to go to each polygon and intersect it with different grids.
I created a simple program that goes from polygon to polygon for the intersection (with multiprocessing),
pool = mp.Pool()
for index,pol in shapefile.iterrows():
# Limits each polygon in shapefile
ylat = lat_gridlimits
xlon= lon_gridlimits
args.append((dgrid,ylat,xlon,pol,index))
pool.starmap(calculate,args)
pool.close()
pool.join()
but memory fills up very quickly and I get an error
PBS: job killed: vmem exceeded limit
How can I know where or when the memory exceeds?
or is there a way to control the memory in each function?
I tried this (inside calculate):
process = psutil.Process(os.getpid())
mem=process.memory_info().rss/(1024.0 ** 3)
vmem=psutil.virtual_memory().total / (1024.0 ** 3)
print("{} {}\n".format(mem,vmem))
but it doesn't help me locate where

One reason you are running out of memory might be because you are using an iterator in your for loop to iterate over a very large dataset. Iterating over this set may take more memory than the python program is allowed to use on your system. One way to save memory is to rewrite the shapefile.iterrows(), which is an iterator, into a function that returns a generator, since generators calculate the new index that needs to be read rather than storing all indexes.
To read more about generators visit following link:
https://pythongeeks.org/python-generators-with-examples/

python3 multiprocess shared numpy array(read-only)

I'm not sure if this title is appropriate for my situation: the reason why I want to share numpy array is that it might be one of the potential solutions to my case, but if you have other solutions that would also be nice.
My task: I need to implement an iterative algorithm with multiprocessing, while each of these processes need to have a copy of data(this data is large, and read-only, and won't change during the iterative algorithm).
I've written some pseudo code to demonstrate my idea:
import multiprocessing
def worker_func(data, args):
# do sth...
return res
def compute(data, process_num, niter):
data
result = []
args = init()
for iter in range(niter):
args_chunk = split_args(args, process_num)
pool = multiprocessing.Pool()
for i in range(process_num):
result.append(pool.apply_async(worker_func,(data, args_chunk[i])))
pool.close()
pool.join()
# aggregate result and update args
for res in result:
args = update_args(res.get())
if __name__ == "__main__":
compute(data, 4, 100)
The problem is in each iteration, I have to pass the data to subprocess, which is very time-consuming.
I've come up with two potential solutions:
share data among processes (it's ndarray), that's the title of this question.
Keep subprocess alive, like a daemon process or something...and wait for call. By doing that, I only need to pass the data at the very beginning.
So, is there any way to share a read-only numpy array among process? Or if you have a good implementation of solution 2, it also works.
Thanks in advance.

If you absolutely must use Python multiprocessing, then you can use Python multiprocessing along with Arrow's Plasma object store to store the object in shared memory and access it from each of the workers. See this example, which does the same thing using a Pandas dataframe instead of a numpy array.
If you don't absolutely need to use Python multiprocessing, you can do this much more easily with Ray. One advantage of Ray is that it will work out of the box not just with arrays but also with Python objects that contain arrays.
Under the hood, Ray serializes Python objects using Apache Arrow, which is a zero-copy data layout, and stores the result in Arrow's Plasma object store. This allows worker tasks to have read-only access to the objects without creating their own copies. You can read more about how this works.
Here is a modified version of your example that runs.
import numpy as np
import ray
ray.init()
#ray.remote
def worker_func(data, i):
# Do work. This function will have read-only access to
# the data array.
return 0
data = np.zeros(10**7)
# Store the large array in shared memory once so that it can be accessed
# by the worker tasks without creating copies.
data_id = ray.put(data)
# Run worker_func 10 times in parallel. This will not create any copies
# of the array. The tasks will run in separate processes.
result_ids = []
for i in range(10):
result_ids.append(worker_func.remote(data_id, i))
# Get the results.
results = ray.get(result_ids)
Note that if we omitted the line data_id = ray.put(data) and instead called worker_func.remote(data, i), then the data array would be stored in shared memory once per function call, which would be inefficient. By first calling ray.put, we can store the object in the object store a single time.

Conceptually for your problem, using mmap is a standard way.
This way, the information can be retrieved from mapped memory by multiple processes
Basic understanding of mmap:
https://en.wikipedia.org/wiki/Mmap
Python has "mmap" module(import mmap)
The documentation of python standard and some examples are in below link
https://docs.python.org/2/library/mmap.html

Dump intermediate results of multiprocessing job to filesystem and continue with processing later on

I have a job that uses the multiprocessing package and calls a function via
resultList = pool.map(myFunction, myListOfInputParameters).
Each entry of the list of input parameters is independent from others.
This job will run a couple of hours. For safety reasons, I would like to store the results that are made in between in regular time intervals, like e.g. once an hour.
How can I do this and be able to continue with the processing when the job was aborted and I want to restart it based on the last available backup?

Perhaps use pickle. Read more here:
https://docs.python.org/3/library/pickle.html
Based on aws_apprentice's comment I created a full multiprocessing example in case you weren't sure how to use intermediate results. The first time this is run it will print "None" as there are no intermediate results. Run it again to simulate restarting.
from multiprocessing import Process
import pickle
def proc(name):
data = None
# Load intermediate results if they exist
try:
f = open(name+'.pkl', 'rb')
data = pickle.load(f)
f.close()
except:
pass
# Do something
print(data)
data = "intermediate result for " + name
# Periodically save your intermediate results
f = open(name+'.pkl', 'wb')
pickle.dump(data, f, -1)
f.close()
processes = []
for x in range(5):
p = Process(target=proc, args=("proc"+str(x),))
p.daemon = True
p.start()
processes.append(p)
for process in processes:
process.join()
for process in processes:
process.terminate()
You can also use json if that makes sense to output intermediate results in human readable format. Or sqlite as a database if you need to push data into rows.

There are at least two possible options.
Have each call of myFunction save its output into a uniquely named file. The file name should be based on or linked to the input data. Use the parent program to gather the results. In this case myFunction should return an identifier of the item that is finished.
Use imap_unordered instead of map. This will start yielding results as soon as they are available, instead of returing when all processing is finished. Have the parent program save the returned data and a indication which items are finished.
In both cases, the program would have to examine the data saved from previous runs to adjust myListOfInputParameters when it is being re-started.
Which option is best depends to a large degree on the amount of data returned by myFunction. If this is a large amount, there is a significant overhead associated with transferring it back to the parent. In that case option 1 is probably best.
Since writing to disk is relatively slow, calculations wil probably go faster with option 2. And it is easier for the parent program to track progress.
Note that you can also use imap_unordered with option 1.

System running out of memory when Python multiprocessing Pool is used?

I am trying to parallelize my code to find the similarity matrix using multiprocessing module in Python. It works fine when I use the small np.ndarray with 10 X 15 elements. But, when I scale my np.ndarray to 3613 X 7040 elements, system runs out of memory.
Below, is my code.
import multiprocessing
from multiprocessing import Pool
## Importing Jacard_similarity_score
from sklearn.metrics import jaccard_similarity_score
# Function for finding the similarities between two np arrays
def similarityMetric(a,b):
return (jaccard_similarity_score(a,b))
## Below functions are used for Parallelizing the scripts
# auxiliary funciton to make it work
def product_helper1(args):
return (similarityMetric(*args))
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
job_args = getArguments(list_a,list_b)
# map to pool
results = p.map(product_helper1, job_args)
p.close()
p.join()
return (results)
## getArguments function is used to get the combined list
def getArguments(list_a,list_b):
arguments = []
for i in list_a:
for j in list_b:
item = (i,j)
arguments.append(item)
return (arguments)
Now when I run the below code, system runs out of memory and gets hanged. I am passing two numpy.ndarrays testMatrix1 and testMatrix2 which are of size (3613, 7040)
resultantMatrix = parallel_product1(testMatrix1,testMatrix2)
I am new to using this module in Python and trying to understand where I am going wrong. Any help is appreciated.

Odds are, the problem is just combinatoric explosion. You're trying to realize all the pairs in the main process up front, rather than generating them live, so you're storing a huge amount of memory. Assuming the ndarrays contain double values, which become Python float, then the memory usage of the list returned by getArguments is roughly the cost of a tuple and two floats per pair, or about:
3613 * 7040 * (sys.getsizeof((0., 0.)) + sys.getsizeof(0.) * 2)
On my 64 bit Linux system, that means ~2.65 GB of RAM on Py3, or ~2.85 GB on Py2, before the workers even do anything.
If you can process the data in a streaming fashion using a generator, so arguments are produced lazily and discarded when no longer needed, you could probably reduce memory usage dramatically:
import itertools
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
# Returns a generator that lazily produces the tuples
job_args = itertools.product(list_a,list_b)
# map to pool
results = p.map(product_helper1, job_args)
p.close()
p.join()
return (results)
This still requires all the results to fit in memory; if product_helper returns floats, then the expected memory usage for the result list on a 64 bit machine would still be around 0.75 GB or so, which is pretty large; if you can process the results in a streaming fashion, iterating the results of p.imap or even better, p.imap_unordered (the latter returns results as computed, not in the order the generator produced the arguments) and writing them to disk or otherwise ensuring they're released in memory quickly would save a lot of memory; the following just prints them out, but writing them to a file in some reingestable format would also be reasonable.
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
# Returns a generator that lazily produces the tuples
job_args = itertools.product(list_a,list_b)
# map to pool
for result in p.imap_unordered(product_helper1, job_args):
print(result)
p.close()
p.join()

The map method sends all data to the workers via inter-process communication. As currently done, this consumes a huge amount of resources, because you're sending
What I would suggest it to modify getArguments to make a list of tuple of indices in the matrix that need to be combined. That's only two numbers that have to be sent to the worker process, instead of two rows of a matrix! Each worker then knows which rows in the matrix to use.
Load the two matrices and store them in global variables before calling map. This way every worker has access to them. And as long as they're not modified in the workers, the OS's virtual memory manager will not copy identical memory pages, keeping memory usage down.

Reading and graphing data read from huge files

We have pretty large files, the order of 1-1.5 GB combined (mostly log files) with raw data that is easily parseable to a csv, which is subsequently supposed to be graphed to generate a set of graph images.
Currently, we are using bash scripts to turn the raw data into a csv file, with just the numbers that need to be graphed, and then feeding it into a gnuplot script. But this process is extremely slow. I tried to speed up the bash scripts by replacing some piped cuts, trs etc. with a single awk command, although this improved the speed, the whole thing is still very slow.
So, I am starting to believe there are better tools for this process. I am currently looking to rewrite this process in python+numpy or R. A friend of mine suggested using the JVM, and if I am to do that, I will use clojure, but am not sure how the JVM will perform.
I don't have much experience in dealing with these kind of problems, so any advice on how to proceed would be great. Thanks.
Edit: Also, I will want to store (to disk) the generated intermediate data, i.e., the csv, so I don't have to re-generate it, should I choose I want a different looking graph.
Edit 2: The raw data files have one record per one line, whose fields are separated by a delimiter (|). Not all fields are numbers. Each field I need in the output csv is obtained by applying a certain formula on the input records, which may use multiple fields from the input data. The output csv will have 3-4 fields per line, and I need graphs that plot 1-2, 1-3, 1-4 fields in a (may be) bar chart. I hope that gives a better picture.
Edit 3: I have modified #adirau's script a little and it seems to be working pretty well. I have come far enough that I am reading data, sending to a pool of processor threads (pseudo processing, append thread name to data), and aggregating it into an output file, through another collector thread.
PS: I am not sure about the tagging of this question, feel free to correct it.

python sounds to be a good choice because it has a good threading API (the implementation is questionable though), matplotlib and pylab. I miss some more specs from your end but maybe this could be a good starting point for you: matplotlib: async plotting with threads.
I would go for a single thread for handling bulk disk i/o reads and sync queueing to a pool of threads for data processing (if you have fixed record lengths things may get faster by precomputing reading offsets and passing just the offsets to the threadpool); with the diskio thread I would mmap the datasource files, read a predefined num bytes + one more read to eventually grab the last bytes to the end of the current datasource lineinput; the numbytes should be chosen somewhere near your average lineinput length; next is pool feeding via the queue and the data processing / plotting that takes place in the threadpool; I don't have a good picture here (of what are you plotting exactly) but I hope this helps.
EDIT: there's file.readlines([sizehint]) to grab multiple lines at once; well it may not be so speedy cuz the docs are saying its using readline() internally
EDIT: a quick skeleton code
import threading
from collections import deque
import sys
import mmap
class processor(Thread):
"""
processor gets a batch of data at time from the diskio thread
"""
def __init__(self,q):
Thread.__init__(self,name="plotter")
self._queue = q
def run(self):
#get batched data
while True:
#we wait for a batch
dataloop = self.feed(self._queue.get())
try:
while True:
self.plot(dataloop.next())
except StopIteration:
pass
#sanitizer exceptions following, maybe
def parseline(self,line):
""" return a data struct ready for plotting """
raise NotImplementedError
def feed(self,databuf):
#we yield one-at-time datastruct ready-to-go for plotting
for line in databuf:
yield self.parseline(line)
def plot(self,data):
"""integrate
https://www.esclab.tw/wiki/index.php/Matplotlib#Asynchronous_plotting_with_threads
maybe
"""
class sharedq(object):
"""i dont recall where i got this implementation from
you may write a better one"""
def __init__(self,maxsize=8192):
self.queue = deque()
self.barrier = threading.RLock()
self.read_c = threading.Condition(self.barrier)
self.write_c = threading.Condition(self.barrier)
self.msz = maxsize
def put(self,item):
self.barrier.acquire()
while len(self.queue) >= self.msz:
self.write_c.wait()
self.queue.append(item)
self.read_c.notify()
self.barrier.release()
def get(self):
self.barrier.acquire()
while not self.queue:
self.read_c.wait()
item = self.queue.popleft()
self.write_c.notify()
self.barrier.release()
return item
q = sharedq()
#sizehint for readine lines
numbytes=1024
for i in xrange(8):
p = processor(q)
p.start()
for fn in sys.argv[1:]
with open(fn, "r+b") as f:
#you may want a better sizehint here
map = mmap.mmap(f.fileno(), 0)
#insert a loop here, i forgot
q.put(map.readlines(numbytes))
#some cleanup code may be desirable

I think python+Numpy would be the most efficient way, regarding speed and ease of implementation.
Numpy is highly optimized so the performance is decent, and python would ease up the algorithm implementation part.
This combo should work well for your case, providing you optimize the loading of the file on memory, try to find the middle point between processing a data block that isn't too large but large enough to minimize the read and write cycles, because this is what will slow down the program
If you feel that this needs more speeding up (which i sincerely doubt), you could use Cython to speed up the sluggish parts.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.