How to effectively release memory in python?

How to effectively release memory in python? - python

I have a large number of data files and each data loaded from a data file are resampled hundreds of times and processed by several methods. I used numpy to do this. What I'm facing is the memory error after running the programs several hours. As each data is processed separately and the results are stored in a .mat file using scipy.savemat, I think the memory used by previous data can be released, so I used del variable_name+gc.collect(), but this does not work. Then I used multiprocessing module, as suggested in this post and this post, it still not works.
Here are my main codes:
import scipy.io as scio
import gc
from multiprocessing import Pool
def dataprocess_session:
i = -1
for f in file_lists:
i += 1
data = scio.loadmat(f)
ixs = data['rm_ix'] # resample indices
del data
gc.collect()
data = scio.loadmat('xd%d.mat'%i) # this is the data, and indices in "ixs" is used to resample subdata from this data
j = -1
mvs_ls_org = {} # preallocate results files as dictionaries, as required by scipy.savemat.
mvs_ls_norm = {}
mvs_ls_auto = {}
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
mvs_ls_org[key] = process(X)
scio.savemat('d%d_ls_org.mat'%i,mvs_ls_org)
del mvs_ls_org
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
X2 = scale(X.copy(), 'norm')
mvs_ls_norm[key] = process(X2)
scio.savemat('d%d_ls_norm.mat'%i,mvs_ls_norm)
del mvs_ls_norm
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
X2 = scale(X.copy(), 'auto')
mvs_ls_auto[key] = process(X2)
scio.savemat('d%d_ls_auto.mat'%i,mvs_ls_auto)
del mvs_ls_auto
gc.collect()
# use another method to process data
j = -1
mvs_fcm_org = {} # also preallocate variable for storing results
mvs_fcm_norm = {}
mvs_fcm_auto = {}
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
dp, _ = process_2(X.copy())
mvs_fcm_org[key] = dp
scio.savemat('d%d_fcm_org.mat'%i,mvs_fcm_org)
del mvs_fcm_org
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
X2 = scale(X.copy(), 'norm')
dp, _ = process_2(X2.copy())
mvs_fcm_norm[key] = dp
scio.savemat('d%d_fcm_norm.mat'%i,mvs_fcm_norm)
del mvs_fcm_norm
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
X2 = scale(X.copy(), 'auto')
dp, _ = process_2(X2.copy())
mvs_fcm_auto[key] = dp
scio.savemat('d%d_fcm_auto.mat'%i,mvs_fcm_auto)
del mvs_fcm_auto
gc.collect()
This is the initial way I've done. I split file_lists into 7 parts, and ran 7 python screens, as my computer has 8 CPU cores. No problem in MATLAB if I do in this way. I do not combine the iterations over ixs for each data process method because the memory error can occur, so I ran resample_from_data and saved results separately. As the memory error persists, I used Pool class as:
pool = Pool(processes=7)
pool.map(dataprocess_session_2, file_lists)
which ran the iteration over file_lists parallelized with file names in file_lists as inputs.
All codes are run in openSuSE with python 2.7.5, 8 cores CPU and 32G RAM. I used top to monitor the memory used. All matrices are not so large and it's ok if I run any one of the loaded data using all codes. But after several iterations over file_lists, the free memory falls dramatically. I'm sure that this phenomenon is not caused by the data itself since no such large memory should be used even the largest data matrix is in processing. So I suspected that the above ways I tried to release the memory used by processing previous data as well as storing processing results did not really release memory.
Any suggestion?

All the variables you del explicitly are automatically released as soon as the loop ends. Subsequently i don't think they are your problem. I think it's more likely that your machine simply can't handle 7 threads with (in worst case) 7 simultaneously executed data = scio.loadmat(f). You could try to mark that call as a criticial section with locks.

this may be of some help, gc.collect()

Related

Can't use Python multiprocessing with large amount of calculations

I have to speed up my current code to do around 10^6 operations in a feasible time. Before I used multiprocessing in my the actual document I tried to do it in a mock case. Following is my attempt:
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
However, when I increase the size of my "List" my computer tires to use all of my RAM and CPU and freezes. I even cut my "List" into 3 pieces so it will only use 3 of my cores. However, nothing changed. Also, when I tried to use it on a smaller data set I noticed the code ran much slower than when it ran on a single core.
I am still very new to multiprocessing in Python, am I doing something wrong. I would appreciate it if you could help me.

To reduce memory usage, I suggest you use instead the multiprocessing module and specifically the imap method method (or imap_unordered method). Unlike the map method of either multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, the iterable argument is processed lazily. What this means is that if you use a generator function or generator expression for the iterable argument, you do not need to create the complete list of arguments in memory; as a processor in the pool become free and ready to execute more tasks, the generator will be called upon to generate the next argument for the imap call.
By default a chunksize value of 1 is used, which can be inefficient for a large iterable size. When using map and the default value of None for the chunksize argument, the pool will look at the length of the iterable first converting it to a list if necessary and then compute what it deems to be an efficient chunksize based on that length and the size of the pool. When using imap or imap_unordered, converting the iterable to a list would defeat the whole purpose of using that method. But if you know what that size would be (more or less) if the iterable were converted to a list, then there is no reason not to apply the same chunksize calculation the map method would have, and that is what is done below.
The following benchmarks perform the same processing first as a single process and then using multiprocessing using imap where each invocation of do_something on my desktop takes approximately .5 seconds. do_something now has been modified to just process a single i, k, j tuple as there is no longer any need to break up anything into smaller lists:
from multiprocessing import Pool, cpu_count
import time
def half_second():
HALF_SECOND_ITERATIONS = 10_000_000
sum = 0
for _ in range(HALF_SECOND_ITERATIONS):
sum += 1
return sum
def do_something(tpl):
# in real case this function takes about 0.5 seconds to finish for each iteration
half_second() # on my desktop
return tpl[0]**2, tpl[1]**2, tpl[2]**2
"""
def generate_tpls():
for i in range(1, 20-2):
for k in range(i+1, 20-1):
for j in range(k+1, 20):
yield i, k, j
"""
# Use smaller number of tuples so we finish in a reasonable amount of time:
def generate_tpls():
# 64 tuples:
for i in range(1, 5):
for k in range(1, 5):
for j in range(1, 5):
yield i, k, j
def benchmark1():
""" single processing """
t = time.time()
for tpl in generate_tpls():
result = do_something(tpl)
print('benchmark1 time:', time.time() - t)
def compute_chunksize(iterable_size, pool_size):
""" This is more-or-less the function used by the Pool.map method """
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
def benchmark2():
""" multiprocssing """
t = time.time()
pool_size = cpu_count() # 8 logical cores (4 physical cores)
N_TUPLES = 64 # number of tuples that will be generated
pool = Pool(pool_size)
chunksize = compute_chunksize(N_TUPLES, pool_size)
for result in pool.imap(do_something, generate_tpls(), chunksize=chunksize):
pass
print('benchmark2 time:', time.time() - t)
if __name__ == '__main__':
benchmark1()
benchmark2()
Prints:
benchmark1 time: 32.261038303375244
benchmark2 time: 8.174998044967651

The nested For loops creating the array before the main definition appears to be the problem. Moving that part to underneath the main definition clears up any memory problems.
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
if __name__ == '__main__':
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1

Is this a case of shared memory?

I'm trying to optimize the following loop:
telePortMatrix = sparse.coo_matrix(self.matrix,dtype=float)
telePortMatrixCSR = telePortMatrix.tocsr()
for r in telePortMatrix.row:
rowCount = telePortMatrix.getrow(r).nnz
rowSum = telePortMatrix.getrow(r).sum()
avg = rowSum/rowCount
for c in telePortMatrix.col:
if telePortMatrixCSR[r,c] != 0:
telePortMatrixCSR[r,c] = avg
with the joblib module:
telePortMatrix = sparse.coo_matrix(self.matrix,dtype=float)
telePortMatrixCSR = telePortMatrix.tocsr()
def loop(r):
rowCount = telePortMatrix.getrow(r).nnz
rowSum = telePortMatrix.getrow(r).sum()
avg = rowSum/rowCount
for c in telePortMatrix.col:
if telePortMatrixCSR[r,c] != 0:
telePortMatrixCSR[r,c] = avg
Parallel(n_jobs=-1)(delayed(loop)(i) for i in telePortMatrix.row)
But according to the Shared-memory semantics section of the documentation, if the parallel function really needs to rely on the shared memory semantics of threads, it should be made explicit with require='sharedmem'. I'm not quite sure if this is the case of shared memory.
Since we're using the same matrix but not a single function call will access the same row twice.

Using Multithreading or Multiprocessing to improve computational speed

I am iterating through very large file size [mesh]. Since the iteration is independent, I would like to split my mesh into smaller sizes and run them all at the same time in order to lower computation time. Below is a sample code. For example, if mesh is of length=50000, I would like to divide the mesh into 100 and run fun for each mesh/100 at the same time.
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
interpolate = fnc(mydata, mymesh)
I would like to know to achieve this using multiprocessing or multithreading as I'm unable to reconcile it with the execution of my loop.

This will give you the general idea. I couldn't test this since I do not have your data. The default constructor for ProcessPoolExecutor will use the number of processors on your computer. But since that determines the level of multiprocessing you can have, it will probably be more efficient to set the N_CHUNKS parameter to the number of simultaneous processes you can support. That is, if you have a processing pool size of 6, then it is better to just divide your array into 6 large chunks and have 6 processes do the work rather than breaking it up into smaller pieces where processes will have to wait to run. So you should probably specify a specific max_workers number to the ProcessPoolExecutor not greater than the number of processors you have and set N_CHUNKS to the same value.
from concurrent.futures import ProcessPoolExecutor, as_completed
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
def main(data, mesh):
#N_CHUNKS = 100
N_CHUNKS = 6 # assuming you have 6 processors; see max_workers parameter
n = len(mesh)
assert n != 0
if n <= N_CHUNKS:
N_CHUNKS = 1
chunk_size = n
last_chunk_size = n
else:
chunk_size = n // N_CHUNKS
last_chunk_size = n - chunk_size * (N_CHUNKS - 1)
with ProcessPoolExcutor(max_workers=N_CHUNKS) as executor: # assuming you have 6 processors
the_futures = {}
start = 0
for i in range(N_CHUNKS - 1):
future = executor.submit(fnc, data, mesh[start:start+chunk_size]) # pass slice
the_futures[future] = (start, start+chunk_size) # map future to request parameters
start += chunk_size
if last_chunk_size:
future = executor.submit(fnc, data, mesh[start:n]) # pass slice
the_futures[future] = (start, start+n)
for future in as_completed(the_futures):
(start, end) = the_futures[future] # the original range
d = future.result() # do something with the results
if __name__ == '__main__':
# the call to main must be done in a block governed by if __name__ == '__main__' or you will get into a recursive
# loop where each subprocess calls main again
main(data, mesh)

Multiprocessing Pool Not Mapping

I am attempting to create Pool() objects, so that I can break down large arrays. Though, each time after the first I run through the below code, the map is never run. Only the first pass seems to enter the function, though the arguments are the same size, even when running it using the EXACT same arguments - only the first
job.map(...)
appears to run. Below is the source of my pain (not all the code in the file):
def iterCount():
#m is not in shared memory, as intended.
global m
m = m + 1
return m
def thread_search(pair):
divisor_lower = pair[0]
divisor_upper = pair[1]
for i in range(divisor_lower, divisor_upper,window_size):
current_section = np.array(x[i: i + window_size])
for row in current_section:
if (row[2].startswith('NP') ) and checkPep(row[0]): #checkPep is a simple unique-in-array checking function.
#shared_list is a multiprocessing.Manager list.
shared_list.append(row[[0,1,2]])
m = iterCount()
if not m%1000000:
print(f'Encountered m = {m}', flush = True)
def poolMap(pairs, group):
job = Pool(3)
print(f'Pool Created')
print(len(pairs))
job.map(thread_search,pairs)
print('Pool Closed')
job.close()
if __name__ == '__main__':
for group in [1,2,3]: #Example times to be run...
x = None
lower_bound = int((group - 1)*group_step)
upper_bound = int(group*group_step)
x = list(csv.reader(open(pa_table_name,"rt", encoding = "utf-8"), delimiter = "\t"))[lower_bound:upper_bound]
print(len(x))
divisor_pairs = [ [int(lower_bound + (i - 1)*chunk_size) , int(lower_bound + i*chunk_size)] for i in range(1,6143) ]
poolMap(divisor_pairs, group)
The output of this function is:
Program started: 03/09/19, 12:41:25 (Machine Time)
11008256 - Length of the file read in (in the group)
Pool Created
6142 - len(pairs)
Encountered m = 1000000
Encountered m = 1000000
Encountered m = 1000000
Encountered m = 2000000
Encountered m = 2000000
Encountered m = 2000000
Encountered m = 3000000
Encountered m = 3000000
Encountered m = 3000000 (Total size is ~ 9 million per set)
Pool Closed
11008256 (this is the number of lines read, correct value)
Pool Created
6142 (Number of pairs is correct, though map appears to never run...)
Pool Closed
11008256
Pool Created
6142
Pool Closed
At this point, the shared_list is saved, and only the first threads results appear to be present.
I'm really at a loss to what is happening here, and I've tried to find bugs (?) or similar instances of any of this.
Ubuntu 18.04
Python 3.6

Optimizing loop for by using vectors python

I created following code:
M=20000
sample_all = np.load('sample.npy')
sd = np.zeros(M)
chi_arr = np.zeros((M,4))
sigma_e = np.zeros((M,41632))
mean_sigma = np.zeros(M)
max_sigma = np.zeros(M)
min_sigma = np.zeros(M)
z = np.load('z_array.npy')
prof = np.load('profile_at_sources.npy')
L = np.load('luminosities.npy')
for k in range(M):
sd[k]=np.array(sp.std(sample_all[k,:]))
arr = np.genfromtxt('samples_fin1.txt').T[2:6]
arr_T = arr.T
chi_arr[k,:] = arr_T[k,:]
sigma_e[k,:]=np.sqrt(calc(z,prof,chi_arr[k,:], L))
mean_sigma[k] = np.array(sp.mean(sigma_e[k,:]))
max_sigma[k] = np.array(sigma_e[k,:].max())
min_sigma[k] = np.array(sigma_e[k,:].min())
where calc(...) is a function that calculates some stuff (is not important for my question)
This loop takes, for M=20000, about 27 hours on my machine. It's enough... There's a way to optimize it, maybe with vectors instead of loop for?
For me it's really simple create loop, my head thinks with loops for this kind of code... It's my limitation... Could you help me? thanks

It seems to me like each of the k-th rows that are created in your various arrays are independent of each k-th iteration of your for loop and only dependent on rows of sigma_e... so you could parallelize it over many workers. Not sure if the code is 100% kosher but you didn't provide a working example.
Note this only works if each k-th iteration is COMPLETELY independent of each k-1th iteration.
M=20000
sample_all = np.load('sample.npy')
sd = np.zeros(M)
chi_arr = np.zeros((M,4))
sigma_e = np.zeros((M,41632))
mean_sigma = np.zeros(M)
max_sigma = np.zeros(M)
min_sigma = np.zeros(M)
z = np.load('z_array.npy')
prof = np.load('profile_at_sources.npy')
L = np.load('luminosities.npy')
workers = 100
arr = np.genfromtxt('samples_fin1.txt').T[2:6] # only works if this is really what you're doing to set arr.
def worker(k_start, k_end):
for k in range(k_start, k_end + 1):
sd[k]=np.array(sp.std(sample_all[k,:]))
arr_T = arr.T
chi_arr[k,:] = arr_T[k,:]
sigma_e[k,:]=np.sqrt(calc(z,prof,chi_arr[k,:], L))
mean_sigma[k] = np.array(sp.mean(sigma_e[k,:]))
max_sigma[k] = np.array(sigma_e[k,:].max())
min_sigma[k] = np.array(sigma_e[k,:].min())
threads = []
kstart = 0
for k in range(0, workers):
T = threading.Thread(target=worker, args=[0 + k * M / workers, (1+ k) * M / workers - 1 ])
threads.append(T)
T.start()
for t in threads:
t.join()
Edited following comments:
Seems like there's a mutex that CPython places on all objects that prevents parallel access. Use IronPython or Jython to step around this. Also, you can move the file read outside if you're really just deserializing the same array from samples_fin1.txt.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to effectively release memory in python? - python

this may be of some help, gc.collect()

Related

Can't use Python multiprocessing with large amount of calculations

Is this a case of shared memory?

Using Multithreading or Multiprocessing to improve computational speed

Multiprocessing Pool Not Mapping

Optimizing loop for by using vectors python

Categories

Resources