Ipyparallel slow execution with scatter/gather - python

Context: I have an array that I have scattered across my engines (4 engines at this time), want to apply a function to each point in the array for an arbitrary number of iterations and gather the resulting array from the engines and perform analysis on it.
For example I have the array of data points, that are scattered and the number of iterations on each data point:
data_points = range(16)
iterations = 10
dview.scatter('points', data_points)
I have a user supplied function as such, which is pushed to the engines:
def user_supplied_function(point):
return randint(0, point)
dview.push(dict(function_one = user_supplied_function))
A list for my results and the parallel execution:
result_list = []
for i in range(iterations):
%px engine_result = [function_one(j) for j in points]
result_list.append(dview.gather('engine_result'))
Issue: This works, and I get the result I want from the engines, however as the number of iterations grows the loop takes longer and longer to execute. To the point where 1000 iterations on 50 points takes upwards of 15 seconds to complete. Whereas a sequential version of this task takes less than a second.
Any idea what could be causing this? Could it be the overhead from the message passing from gather()? If so can anyone suggest any solutions?

Figured it out. It was the overhead from gather() and .append() after all. The easiest fix is to gather() after the engines have finished their work, as opposed to doing it each iteration.
Solution
%autopx
engine_result = []
for i in xrange(iterations):
engine_result += [[function_one(j) for j in points]]
%autopx
result_list = list(dview.gather('engine_result'))
This, however, gets the results in a poorly formatted list of lists where the results from each engine are placed next to each other instead of ordered by iteration number. The following commands distribute the lists and flatten the sublists for each iteration.
gathered_list = [None] * iterations
gathered_list = [[result_list[j * iterations + i] for j in xrange(len(result_list) / iterations)] for i in xrange(iterations)]
gathered_list = [reduce(lambda x, y: x.extend(y) or x, z) for z in gathered_list]

Related

Anyway to salvage this code snippet to avoid memory bottlenecks?

This snippet below was obtained from https://github.com/suanrong/SDNE in /utils/utils.py.
The issue I am having is this get_precisionK function calls the getSimilarity function which computes the product of a NxM matrix and its transpose. The product is symmetric and is NxN, so we could decrease memory usage by almost half. But the bigger issue here is N is on the order of 10**5. In the particular problem I am working with, N is about 600k.
A 600k x 600k matrix of doubles requires like 3TB of memory, something that I do not have. So the code is crashing (because of memory limitations) when trying to compute the np.dot. I am trying to restructure the code without exceeding memory limitations. But the way the function uses the product makes it a little difficult.
The only thing that I could come up with is feeding in a subset of the rows of the embedding matrix to the getSimilarity function, and compute the product, then sort this and write this to a file (EDIT: Actually instead of sorting and writing to file, I think I would just write the values to the file, and sort it later on). So we'd end up having multiple files with sorted indices, we'd have to sync all the files together into one big giant file somehow. The syncing part is another challenge I think.
The for loop loops in descending order. I would need to then read in consecutive small portions (whatever memory allows) of the aforementioned giant file, and do the operations.
This sounds quite complicated and I wanted to see if any others have better ideas.
def getSimilarity(result):
print "getting similarity..."
return np.dot(result, result.T)
****Note that embedding is NxM mumpy array where N is very large, ~ O(10^5) ****
def check_reconstruction(embedding, graph_data, check_index):
def get_precisionK(embedding, data, max_index):
print "get precisionK..."
similarity = getSimilarity(embedding).reshape(-1)
sortedInd = np.argsort(similarity)
cur = 0
count = 0
precisionK = []
sortedInd = sortedInd[::-1]
for ind in sortedInd:
x = ind / data.N
y = ind % data.N
count += 1
if (data.adj_matrix[x].toarray()[0][y] == 1 or x == y):
cur += 1
precisionK.append(1.0 * cur / count)
if count > max_index:
break
return precisionK
precisionK = get_precisionK(embedding, graph_data, np.max(check_index))
ret = []
for index in check_index:
print "precisonK[%d] %.2f" % (index, precisionK[index - 1])
ret.append(precisionK[index - 1])
return ret

Updating nested list takes too long in python

I am trying to implement brown clustering algorithm in python.
I have data structure of cluster = List[List]
At any gives time the outside list length will be maximum 40 or 41.
But internal list contains english words such as 'the', 'hello' etc
So I have total of words 8000(vocabulary) and initially first 40 words are put into cluster.
I iterate over my vocabulary from 41 to 8000
# do some compution this takes very less times.
# Merge 2 item in list and delete one item from list
# ex: if c1 and c2 are items of clusters then
for i in range(41, 8000):
clusters.append(vocabulary[i])
c1 = computation 1
c2 = computation 2
clusters[c1] = clusters[c1] + clusters[c2]
del clusters[c2]
But the time takes for line clusters[c1] = clusters[c1] + clusters[c1] grows gradually as i iterate over my vocabulary. Initially for 41-50 is it 1sec, but for every 20 items in vocabulary the time grows by 1 sec.
On commenting just clusters[c1] = clusters[c1] + clusters[c1] from my entire code, i observer all iterations takes constant time. I am not sure how can i speed up this process.
for i in range(41, 8000):
clusters.append(vocabulary[i])
c1 = computation 1
c2 = computation 2
#clusters[c1] = clusters[c1] + clusters[c2]
del clusters[c2]
I am new to stackoverflow, please excuse me if any incorrect formatting here.
Thanks
The problem you're running into is that list concatenation is a linear time operation. Thus, your entire loop is O(n^2) (and that's prohibitively slow for n much larger than 1000). This is ignoring how copying such large lists can be bad for cache performance, etc.
Disjoint Set data structure
The solution I recommend is to use a disjoint set datastructure. This is a tree-based datastructure that "self-flattens" as you perform queries, resulting in a very fast runtimes for "merging" clusters.
The basic idea is that each word starts off as its own "singleton" tree, and merging clusters consists of making the root of one tree the child of another. This repeats (with some care for balancing) until you have as many clusters as desired.
I've written an example implementation (GitHub link) that assumes elements of each set are numbers. As long as you have a mapping from vocabulary terms to integers, it should work just fine for your purposes. (Note: I've done some preliminary testing, but I wrote it in 5 minutes right now so I'd recommend checking my work. ;) )
To use in your code, I would do something like the following:
clusters = DisjointSet(8000)
# some code to merge the first 40 words into clusters
for i in range(41, 8000):
c1 = some_computation() # assuming c1 is a number
c2 = some_computation() # assuming c2 is a number
clusters.join(c1, c2)
# Now, if you want to determine if some word with number k is
# in the same cluster as a word with number j:
print("{} and {} are in the same cluster? {}".format(j, k, clusters.query(j, k))
Regarding Sets vs Lists
While sets provide faster access time than lists, they actually have worse runtime when copying. This makes sense in theory, because a set object actually has to allocate and assign more memory space than a list for an appropriate load factor. Also, it is likely inserting so many items could result in a "rehash" of the entire hash table, which is a quadratic-time operation in worst-case.
However, practice is what we're concerned with now, so I ran a quick experiment to determine exactly how worse off sets were than lists.
Code for performing this test, in case anyone was interested, is below. I'm using the Intel packaging of Python, so my performance may be slightly faster than on your machine.
import time
import random
import numpy as np
import matplotlib.pyplot as plt
data = []
for trial in range(5):
trial_data = []
for N in range(0, 20000, 50):
l1 = random.sample(range(1000000), N)
l2 = random.sample(range(1000000), N)
s1 = set(l1)
s2 = set(l2)
# Time to concatenate two lists of length N
start_lst = time.clock()
l3 = l1+l2
stop_lst = time.clock()
# Time to union two sets of length N
start_set = time.clock()
s3 = s1|s2
stop_set = time.clock()
trial_data.append([N, stop_lst - start_lst, stop_set - start_set])
data.append(trial_data)
# average the trials and plot
data_array = np.array(data)
avg_data = np.average(data_array, 0)
fig = plt.figure()
ax = plt.gca()
ax.plot(avg_data[:,0], avg_data[:,1], label='Lists')
ax.plot(avg_data[:,0], avg_data[:,2], label='Sets')
ax.set_xlabel('Length of set or list (N)')
ax.set_ylabel('Seconds to union or concat (s)')
plt.legend(loc=2)
plt.show()

How to make huge loops in python faster on mac?

I am a computer science student and some of the things I do require me to run huge loops on Macbook with dual core i5. Some the loops take 5-6 hours to complete but they only use 25% of my CPU. Is there a way to make this process faster? I cant change my loops but is there a way to make them run faster?
Thank you
Mac OS 10.11
Python 2.7 (I have to use 2.7) with IDLE or Spyder on Anaconda
Here is a sample code that takes 15 minutes:
def test_false_pos():
sumA = [0] * 1000
for test in range(1000):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
avg = np.mean(sumA)
return avg
Sure thing: Python 2.7 has to generate huge lists and waste a lot of memory each time you use range(<a huge number>).
Try to use the xrange function instead. It doesn't create that gigantic list at once, it produces the members of a sequence lazily.
But if your were to use Python 3 (which is the modern version and the future of Python), you'll find out that there range is even cooler and faster than xrange in Python 2.
You could split it up into 4 loops:
import multiprocessing
def test_false_pos(times, i, q):
sumA = [0] * times
for test in range(times):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
q.put([i, list(sumA)])
def full_test(pieces):
threads = []
q = multiprocessing.Queue()
steps = 1000 / pieces
for i in range(pieces):
threads.append(multiprocessing.Process(target=test_false_pos, args=(steps, i, q)))
[thread.start() for thread in threads]
results = [None] * pieces
for i in range(pieces):
i, result = q.get()
results[i] = result
# Flatten the array (`results` looks like this: [[...], [...], [...], [...]])
# source: https://stackoverflow.com/a/952952/5244995
sums = [value for result in results for val in result]
return np.mean(np.array(sums))
if __name__ == '__main__':
full_test(multiprocessing.cpu_count())
This will run n processes that each do 1/nth of the work, where n is the number of processors on your computer.
The test_false_pos function has been modified to take three parameters:
times is the number of times to run the loop.
i is passed through to the result.
q is a queue to add the results to.
The function loops times times, then places i and sumA into the queue for further processing.
The main thread (full_test) waits for each thread to complete, then places the results in the appropriate position in the results list. Once the list is complete, it is flattened, and the mean is calculated and returned.
Consider looking into Numba and Jit (just in time compiler). It works for functions that are Numpy based. It can handle some python routines, but is mainly for speeding up numerical calculations, especially ones with loops (like doing cholesky rank-1 up/downdates). I don't think it would work with a BloomFilter, but it is generally super helpful to know about.
In cases where you must use other packages in your flow with numpy, separate out the heavy-lifting numpy routines into their own functions, and throw a #jit decorator on top of that function. Then put them into your flows with normal python stuff.

Simulating multiple Poisson processes

I have N processes and a different Poisson rate for each. I would like to simulate arrival times from all N processes. If N =1 I can do this
t = 0
N = 1
for i in range(1,10):
t+= random.expovariate(15)
print N, t
However if I have N = 5 and a list of rates
rates = [10,1,15,4,2]
I would like somehow for the loop to output the arrival times of all N processes in the right order. That is I would still like only two numbers per line (the ID of the process and the arrival time) but globally sorted by arrival time.
I could just make N lists and merge them afterwards but I would like the arrival times to be outputted in the right order in the first place.
Update. One problem is that if you just sample a fixed number of arrivals from each process, you get only early times from the high rate processes. So I think I need to sample from a fixed time interval for each process so the number of samples varies depending on the rate.
If I'm understanding you correctly:
import random
import itertools
def arrivalGenerator(rate):
t = 0
while True:
t += random.expovariate(rate)
yield t
rates = [10, 1, 15, 4, 2]
t = [(i, 0) for i in range(0, len(rates))]
arrivals = []
for i in range(len(rates)):
t = 0
generator = arrivalGenerator(rates[i])
arrivals += [(i, arrival) \
for arrival in itertools.takewhile(lambda t: t < 100, generator)]
sorted_arrivals = sorted(arrivals, key=lambda x: x[1])
for arrival in sorted_arrivals:
print arrival[0], arrival[1]
Note that your initial logic was generating a fixed number of arrivals for each process. What you really want is a specific time window, and to keep generating for a given process until you exceed that time window.
Following http://www.columbia.edu/~ks20/4703-Sigman/4703-07-Notes-PP-NSPP.pdf I think there is a more efficient answer.
You do roughly:
total_rate = sum(rates)
probabilities = [ r/total_rate for r in rates ]
arrivals = []
t = 0
while t < T:
t += random.expovariate(total_rate)
i = weighted_random(probabilities)
arrivals += (i, t)
This method eliminates the need to keep coroutine state around for a large number of different arrival processes. There's just a single "net" arrival process. The distribution will be the same.
Note that I have not given an implementation for weighted_random above, but I assume my intention is clear. It is left as an exercise for the reader ;-) -- or see e.g. http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python.
You can also do:
arrivals = []
t = 0
while t < T:
dt_list = [ random.expovariate(r) for r in rates ]
(dt,i) = min((tau,i) for i,tau in enumerate(dt_list))
t += dt
arrivals += (i, t)
i.e., you actually do generate separate interarrival times for all processes, but you do not need to "remember" the states of the processes. Note that the minimum of two independent exponentially-distributed random variables with rates r1 and r2 is itself exponentially distributed with rate r1+r2 (per http://ocw.mit.edu/courses/mathematics/18-440-probability-and-random-variables-spring-2011/lecture-notes/MIT18_440S11_Lecture20.pdf), so this is actually quite similar to the previous code snippet.
Of the two methods I have given here, I think the first is better:
The first is O( len(arrivals) * log(len(rates)) ) whereas the second is O( len(arrivals) * len(rates) )
The first requires 2 random numbers from the underlying generator per arrival, whereas the second requires len(rates) random numbers per arrival.
The first requires 1 evaluation of log (I assume this is how the exponential random variable is generated) per arrival, whereas the second requires O(len(rates)) evaluations of log per arrival.
Also, take all of the above Python syntax with a grain of salt (I have not run it, and I am rusty with Python), and eliminate temporary lists if you like. This is meant as "pseudocode" really; for a fast Monte Carlo simulation you'd probably use C++ (and/or CUDA) anyway.
I know you're probably well past the point of needing this answer, but I hope it can be helpful to others who find this post.

python: improve performance and/or method to avoid memory error creating, saving and deleting variable variables

I have been fighting against a function giving me a memory error and thanks to your support (Python: how to split and return a list from a function to avoid memory error) I managed to sort the issue; however, since I am not a pro-programmer I would like to ask for your opinion on my method and how to improve its performance (if possible).
The function is a generator function returning all cycles from an n-nodes digraph. However, for a 12 nodes digraph, there are about 115 million cycles (each defined as a list of nodes, e.g. [0,1,2,0] is a cycle). I need all cycles available for further processing even after I have extracted some of their properties when they were first generated, so they need to be stored somewhere. So, the idea is to cut the result array every 10 million cycles to avoid memory error (when an array is too big, python runs out of RAM) and create a new array to store the following results. In the 12 node digraph, I would then have 12 result arrays, 11 full ones (containing 10 million cycles each) and the last containing 5 million cycles.
However, splitting the result array is not enough since the variables stay in RAM. So, I still need to write each one to the disk and delete it afterwards to clear the RAM.
As stated in How do I create a variable number of variables?, using 'exec' to create variable variable names is not very "clean" and dictionary solutions are better. However, in my case, if I store the results in a single dictionary, it will run out of memory due to the size of the arrays. Hence, I went for the 'exec' way. I would be grateful if you could comment on that decision.
Also, to store the arrays I use numpy.savez_compressed which gives me a 43 Mb file for each 10million cycles array. If it is not compressed it creates a 500 Mb file. However, using the compressed version slows the writing process. Any idea how to speed the writing and/or compressing process?
A simplified version of the code I wrote is as follows:
nbr_result_arrays=0
result_array_0=[]
result_lenght=10000000
tmp=result_array_0 # I use tmp to avoid using exec within the for loop (exec slows down code execution)
for cycle in generator:
tmp.append(cycle)
if len(tmp) == result_lenght:
exec 'np.savez_compressed(\'results_' +str(nbr_result_arrays)+ '\', tmp)'
exec 'del result_array_'+str(nbr_result_arrays)
nbr_result_arrays+=1
exec 'result_array_'+str(nbr_result_arrays)+'=[]'
exec 'tmp=result_array_'+str(nbr_result_arrays)
Thanks for reading,
Aleix
How about using itertools.islice?
import itertools
import numpy as np
for i in itertools.count():
tmp = list(itertools.islice(generator, 10000000))
if not tmp:
break
np.savez_compressed('results_{}'.format(i), tmp)
del tmp
thanks to all for your suggestions.
As suggested by #Aya, I believe that to improve performance (and possible space issues) I should avoid to store the results on the HD because storing them adds half of the time than creating the result, so loading and processing it again would get very close to creating the result again. Additionally, if I do not store any result, I save space which can become a big issue for bigger digraphs (a 12 node complete digraphs has about 115 million cycles but a 29 node ones has about 848E27 cycles... and increasing at factorial rate).
The idea is that I first need to find through all cycles going through the weakest arc to find the total probability of all cycles going it. Then, with this total probability I must go again through all those cycles to subtract them from the original array according to the weighted probability (I needed the total probability to be able to calculate the weighted probalility: weighted_prob= prob_of_this_cycle/total_prob_through_this_edge).
Thus, I believe that this is the best approach to do that (but I am open to more discussions! :) ).
However, I have a doubt regarding speed processing regarding two sub-functions:
1st: find whether a sequence contains a specific (smaller) sequence. I am doing that with the function "contains_sequence" which relies on the generator function "window" (as suggested in Is there a Python builtin for determining if an iterable contained a certain sequence? However I have been told that doing it with a deque would be up to 33% faster. Any other ideas?
2nd: I am currently finding the cycle probability of a cycle by sliding through the cycle nodes (which is represented by a list) to find the probability at the output of each arc to stay within the cycle and then multiply them all to find the cycle probability (the function name is find_cycle_probability). Any performance suggestions on this function would be appreciated since I need to run it for each cycle, i.e. countless times.
Any other tips/suggestion/comments will be most welcome! And thanks again for your help.
Aleix
Below follows the simplified code:
def simple_cycles_generator_w_filters(working_array_digraph, arc):
'''Generator function generating all cycles containing a specific arc.'''
generator=new_cycles.simple_cycles_generator(working_array_digraph)
for cycle in generator:
if contains_sequence(cycle, arc):
yield cycle
return
def find_smallest_arc_with_cycle(working_array,working_array_digraph):
'''Find the smallest arc through which at least one cycle flows.
Returns:
- if such arc exist:
smallest_arc_with_cycle = [a,b] where a is the start of arc and b the end
smallest_arc_with_cycle_value = x where x is the weight of the arc
- if such arc does not exist:
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0 '''
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0
sparse_array = []
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] !=0:
sparse_array.append([i,j,working_array[i][j]])
sorted_array=sorted(sparse_array, key=lambda x: x[2])
for i in range(len(sorted_array)):
smallest_arc=[sorted_array[i][0],sorted_array[i][1]]
generator=simple_cycles_generator_w_filters(working_array_digraph,smallest_arc)
if any(generator):
smallest_arc_with_cycle=smallest_arc
smallest_arc_with_cycle_value=sorted_array[i][2]
break
return smallest_arc_with_cycle,smallest_arc_with_cycle_value
def window(seq, n=2):
"""Returns a sliding window (of width n) over data from the iterable
s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... """
it = iter(seq)
result = list(itertools.islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + [elem]
yield result
def contains_sequence(all_values, seq):
return any(seq == current_seq for current_seq in window(all_values, len(seq)))
def find_cycle_probability(cycle, working_array, total_outputs):
'''Finds the cycle probability of a given cycle within a given array'''
output_prob_of_each_arc=[]
for i in range(len(cycle)-1):
weight_of_the_arc=working_array[cycle[i]][cycle[i+1]]
output_probability_of_the_arc=float(weight_of_the_arc)/float(total_outputs[cycle[i]])#NOTE:total_outputs is an array, thus the float
output_prob_of_each_arc.append(output_probability_of_the_arc)
circuit_probabilities_of_the_cycle=numpy.prod(output_prob_of_each_arc)
return circuit_probabilities_of_the_cycle
def clean_negligible_values(working_array):
''' Cleans the array by rounding negligible values to 0 according to a
pre-defined threeshold.'''
zero_threeshold=0.000001
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] == 0:
continue
elif 0 < working_array[i][j] < zero_threeshold:
working_array[i][j] = 0
elif -zero_threeshold <= working_array[i][j] < 0:
working_array[i][j] = 0
elif working_array[i][j] < -zero_threeshold:
sys.exit('Error')
return working_array
original_array= 1000 * numpy.random.random_sample((5, 5))
total_outputs=numpy.sum(original_array,axis=0) + 100 * numpy.random.random_sample(5)
working_array=original_array.__copy__()
straight_array= working_array.__copy__()
cycle_array=numpy.zeros(numpy.shape(working_array))
iteration_counter=0
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value]= find_smallest_arc_with_cycle(working_array, working_array_digraph)
while smallest_arc_with_cycle: # using implicit true value of a non-empty list
cycle_flows_to_be_subtracted = numpy.zeros(numpy.shape((working_array)))
# FIRST run of the generator to calculate each cycle probability
# note: the cycle generator ONLY provides all cycles going through
# the specified weakest arc
generator = simple_cycles_generator_w_filters(working_array_digraph, smallest_arc_with_cycle)
nexus_total_probs = 0
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
nexus_total_probs += cycle_prob
# SECOND run of the generator
# using the nexus_prob_sum calculated before, I can allocate the weight of the
# weakest arc to each cycle going through it
generator = simple_cycles_generator_w_filters(working_array_digraph,smallest_arc_with_cycle)
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
allocated_cycle_weight = cycle_prob / nexus_total_probs * smallest_arc_with_cycle_value
# create the array to be substracted
for i in range(len(cycle)-1):
cycle_flows_to_be_subtracted[cycle[i]][cycle[i+1]] += allocated_cycle_weight
working_array = working_array - cycle_flows_to_be_subtracted
clean_negligible_values(working_array)
cycle_array = cycle_array + cycle_flows_to_be_subtracted
straight_array = straight_array - cycle_flows_to_be_subtracted
clean_negligible_values(straight_array)
# find the next weakest arc with cycles.
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value] = find_smallest_arc_with_cycle(working_array,working_array_digraph)

Categories

Resources