I'm using SageMath to perform some mathematical calculations, and at one point I have a for loop that looks like this:
uni = {}
end = (l[idx]^(e[idx] - 1)) * (l[idx] + 1) # where end in my case is about 2013265922,
# but can also be much much larger too.
for count in range(0, end):
i = randint(1, 303325737249669131) # this executes very fast in Sage
if i in uni:
uni[i] += 1
else:
uni[i] = 1
So basically, I want to create very large number of random integers in the given range, check whether the number was already in the dictionary, if yes increment its count, if not initialize it to 1. But, the loop takes such a long time that it doesn't finish in a reasonable amount of time, and not because the operations inside the loop are complicated, but instead because there are a huge number of iterations to be performed. Therefore, I want to ask whether there is any way to avoid (or speed up) this kind of loops in Python?
I profiled your code (use cProfile for this) and the vast majority of the time spent, is spend within the randint function that is called for each iteration of the loop.
I recommend you vectorize the loop using numpy random number generation libraries, and then a single call to the Counter class to extract frequency counts.
import numpy.random
import numpy
from collections import Counter
assert 303325737249669131 < 18446744073709551615 # limit for uint64
numbers = numpy.random.randint(low=0, high=303325737249669131, size=end,
dtype=numpy.uint64)
frequency = Counter(numbers)
For a loop of 1000000 iterations (smaller than the one you suggest) I observed a reduction from 6 seconds to about 1 second. So even with this you cannot expect more than an order of magnitude reduction in terms of computation time.
You may think that keeping an array of all the values in memory is inefficient, and may lead to memory exhaustion before the computation ends. However, due to the small value of "end" compared with the range of the random integers the rate at which you will be recording collisions is low, and therefore the memory cost of a full array is not significantly larger than storing the dictionary. However, if this becomes and issue you may wish to perform the computation in batches. In that spirit you may also want to use the multiprocessing facilities to distribute computations across many CPUs or even many machines (but lookout for network costs if you chose that).
Biggest speedup you can make without low-level magic is using defaultdict, i.e.
uni = defaultdict(int)
for count in range(0, end):
i = randint(1, 303325737249669131) # this executes very fast in Sage
uni[i] += 1
If you're using python2, change range to xrange.
Except this - I'm pretty sure that its somewhere near limit for python. Loop is
generating random integer (optimized as much as possible without static typing)
calculating hash
updating dict. With defaultdict if-else branches is factored to more optimized code
from time to time, malloc calls to resize dict - this is fast (considering inablity to preallocate memory for dict)
Related
I am having an issue with my attempt in speeding up the computation of my program. In the serialized python version of my code, I'm computing the values of a function f(x), which returns a float, for sliding windows of the NumPy array as can be seen below:
a = np.array([i for i in range(1, 10000000)]) # Some data here
N = 100
result = []
for i in range(N, len(a)):
result.append(f(a[i - N:i]))
Since the NumPy array is really large and f(x) runtime is high, I've tried to apply multiprocessing to speed up my code. Through my research, I found that charm4py might be a great solution and it has a Pool feature, which breaks up an array in chunks and distributes work between spawned processes. I've implemented charm4py's multiprocessing example and then, translated it to my case:
# Split an array into subarrays for sequential processing (takes only 5 seconds)
a = np.array([a[i - N:i] for i in range(N, len(a))])
result = charm.pool.map(f, a, chunksize=512, ncores=-1)
# I'm running this code through "charmrun +p18 example.py"
The issue that I've encountered is that code was running a lot slower, despite being executed on a more powerful instance (18 physical cores vs 6 physical cores).
I've expected to see ~3x improvement, but it didn't happen. While searching for solutions I've learned that there is some overhead for expensive deserialization/spinning up new processes, but I am not sure if this is the case.
I would really appreciate any feedback or suggestions on how one can implement fast parallel processing of a NumPy array (assuming that function f(x) is not vectorized, takes a pretty long time to compute, and internally makes a large number of specific/individual calls that cannot be parallelized)?
Thank you!
It sounds like you're trying to parallelize this operation with either Charm or Ray (it's not clear how you would use both together).
If you choose to use Ray, and your data is a numpy array, you can take advantage of zero-copy reads to avoid any deserialization overhead.
You may want to optimize your sliding window function a bit, but it will likely look like this:
#ray.remote
def apply_rolling(f, arr, start, end, window_size):
results_arr = []
for i in range(start, end - window_size):
results_arr.append(f(arr[i : i + windows_size])
return np.array(results_arr)
note that this structure lets us call f multiple times within a single task (aka batching).
To use our function:
# Some small setup
big_arr = np.arange(10000000)
big_arr_ref = ray.put(big_arr)
batch_size = len(big_arr) // ray.available_resources()["CPU"]
window_size = 100
# Kick off our tasks
result_refs = []
for i in range(0, big_arr, batch_size):
end_point = min(i + batch_size, len(big_arr))
ref = apply_rolling.remote(f, big_arr_ref, i, end_point)
result_refs.append(ref)
# Handle the results
flattened = []
for section in ray.get(result_refs):
flattened.extend(section)
I'm sure you'll want to customize this code, but here are 2 important and nice properties that you'll likely want to maintain.
Batching: We're utilizing batching to avoid starting too many tasks. In any system, parallelizing incurs overhead, so we always want to be careful and make sure we don't start too many tasks. Furthermore, we are calculating batch_size = len(big_arr) // ray.available_resources()["CPU"] to make sure we use exactly the same number of batches as we have CPUs.
Shared memory: Since Ray's object store supports zero copy reads from numpy arrays, calling ray.get or reading from a numpy array is pretty much free (on a single machine where there are no network costs). There is some overhead in serializing/calling ray.put though, so this approach only calls put (the expensive operation) once, and ray.get (which is implicitly called) many times.
Tip: Be careful when passing arrays as parameters directly into remote functions. It will call ray.put multiple times, even if you pass the same object.
Here's an example based off of your code snippet that uses Ray to parallelize the array computations.
Note that the best way to do this will depend on what your function f looks like.
import numpy as np
import ray
import time
ray.init()
N = 100000
a = np.arange(10**7)
a_id = ray.put(a)
#ray.remote
def f(array, index):
# Do processing
time.sleep(0.2)
return 1
result_ids = []
for i in range(len(a) // N):
result_ids.append(f.remote(a_id, i))
results = ray.get(result_ids)
When I compute function outputs with a mass of independent varying inputs, I used Pool for parallel computing and wrote code like
num_cores = int(multiprocessing.cpu_count())
with Pool(processes=num_cores) as pool:
results = pool.map(myfunc, xlist)
where xlist = [x1,x2,...,xn] is the input list.
This code goes pretty well with a short xlist as expected, but behaves extremely slow when input list gets large. What I actually mean is that the performance would degenerate sharply.
For example, if this code takes only T (second) for a list with length==10, then it take about 10T for a 10length list, and 100T for a 100length list. However, it may take 50000T or even much more time for a 1000T length list (My computer has a large enough memory storage).
Currently I split the long list into multiple short sub-lists piece by piece and make a success in avoiding the degeneration. But I wonder why this bizarre situation could happen. What should I do to directly feed the long list into the pool without any performance degeneration?
Let's say I have a huge list containing random numbers for example
L = [random.randrange(0,25000000000) for _ in range(1000000000)]
I need to get rid of the duplicates in this list
I wrote this code for lists containing a smaller number of elements
def remove_duplicates(list_to_deduplicate):
seen = set()
result=[]
for i in list_to_deduplicate:
if i not in seen:
result.append(i)
seen.add(i)
return result
In the code above I create a set so I can memorize what numbers have already appeared in the list I'm working on if the number is not in the set then I add it to the result list I need to return and save it in the set so it won't be added again in the result list
Now for 1000000 number in a list all is good I can get a result fast but for numbers superior to let's say 1000000000 problems arise I need to use the different cores on my machine to try and break up the problem and then combine the results from multiple processes
My first guess was to make a set accessible to all processes but many complications will arise
How can a process read while maybe another one is adding to the set and I don't even know if it is possible to share a set between processes I know we can use a Queue or a pipe but I'm not sure on how to use it
Can someone give me an advice on what is the best way to solve this problem
I am open to any new idea
I'm skeptic even your greatest list is big enough so that multiprocessing would improve timings. Using numpy and multithreading is probably your best chance.
Multiprocessing introduces quite some overhead and increases memory consumption like #Frank Merrow rightly mentioned earlier.
That's not the case (to that extend) for multithreading, though. It's important to not mix these terms up because processes and threads are not the same.
Threads within the same process share their memory, distinct processes do not.
The problem with going multi-core in Python is the GIL, which doesn't allow multiple threads (in the same process) to execute Python bytecode in parallel. Some C-extensions like numpy can release the GIL, this enables profiting from multi-core parallelism with multithreading. Here's your chance to get some speed up on top of a big improvement just by using numpy.
from multiprocessing.dummy import Pool # .dummy uses threads
import numpy as np
r = np.random.RandomState(42).randint(0, 25000000000, 100_000_000)
n_threads = 8
result = np.unique(np.concatenate(
Pool(n_threads).map(np.unique, np.array_split(r, n_threads)))
).tolist()
Use numpy and a thread-pool, split up the array, make the sub-arrays unique in separate threads, then concatenate the sub-arrays and make the recombined array once more unique again.
The final dropping of duplicates for the recombined array is necessary because within the sub-arrays only local duplicates can be identified.
For low entropy data (many duplicates) using pandas.unique instead of numpy.unique can be much faster. Unlike numpy.unique it also preserves order of appearance.
Note that using a thread-pool like above makes only sense if the numpy-function is not already multi-threaded under the hood by calling into low-level math libraries. So, always test to see if it actually improves performance and don't take it for granted.
Tested with 100M random generated integers in the range:
High entropy: 0 - 25_000_000_000 (199560 duplicates)
Low entropy: 0 - 1000
Code
import time
import timeit
from multiprocessing.dummy import Pool # .dummy uses threads
import numpy as np
import pandas as pd
def time_stmt(stmt, title=None):
t = timeit.repeat(
stmt=stmt,
timer=time.perf_counter_ns, repeat=3, number=1, globals=globals()
)
print(f"\t{title or stmt}")
print(f"\t\t{min(t) / 1e9:.2f} s")
if __name__ == '__main__':
n_threads = 8 # machine with 8 cores (4 physical cores)
stmt_np_unique_pool = \
"""
np.unique(np.concatenate(
Pool(n_threads).map(np.unique, np.array_split(r, n_threads)))
).tolist()
"""
stmt_pd_unique_pool = \
"""
pd.unique(np.concatenate(
Pool(n_threads).map(pd.unique, np.array_split(r, n_threads)))
).tolist()
"""
# -------------------------------------------------------------------------
print(f"\nhigh entropy (few duplicates) {'-' * 30}\n")
r = np.random.RandomState(42).randint(0, 25000000000, 100_000_000)
r = list(r)
time_stmt("list(set(r))")
r = np.asarray(r)
# numpy.unique
time_stmt("np.unique(r).tolist()")
# pandas.unique
time_stmt("pd.unique(r).tolist()")
# numpy.unique & Pool
time_stmt(stmt_np_unique_pool, "numpy.unique() & Pool")
# pandas.unique & Pool
time_stmt(stmt_pd_unique_pool, "pandas.unique() & Pool")
# ---
print(f"\nlow entropy (many duplicates) {'-' * 30}\n")
r = np.random.RandomState(42).randint(0, 1000, 100_000_000)
r = list(r)
time_stmt("list(set(r))")
r = np.asarray(r)
# numpy.unique
time_stmt("np.unique(r).tolist()")
# pandas.unique
time_stmt("pd.unique(r).tolist()")
# numpy.unique & Pool
time_stmt(stmt_np_unique_pool, "numpy.unique() & Pool")
# pandas.unique() & Pool
time_stmt(stmt_pd_unique_pool, "pandas.unique() & Pool")
Like you can see in the timings below, just using numpy without multithreading already accounts for the biggest performance improvement. Also note pandas.unique() being faster than numpy.unique() (only) for many duplicates.
high entropy (few duplicates) ------------------------------
list(set(r))
32.76 s
np.unique(r).tolist()
12.32 s
pd.unique(r).tolist()
23.01 s
numpy.unique() & Pool
9.75 s
pandas.unique() & Pool
28.91 s
low entropy (many duplicates) ------------------------------
list(set(r))
5.66 s
np.unique(r).tolist()
4.59 s
pd.unique(r).tolist()
0.75 s
numpy.unique() & Pool
1.17 s
pandas.unique() & Pool
0.19 s
Can't say I like this, but it should work, after a fashion.
Divide the data in N readonly pieces. Distribute one per worker to research the data. Everything is readonly, so it can all be shared. Each worker i 1...N checks its list against all the other 'future' lists i+1...N
Each worker i maintains a bit table for its i+1...N lists noting if any of the its items hit any of the the future items.
When everyone is done, worker i sends it's bit table back to master where tit can be ANDed. the zeroes then get deleted. No sorting no sets. The checking is not fast, tho.
If you don't want to bother with multiple bit tables you can let every worker i write zeroes when they find a dup above their own region of responsibility. HOWEVER, now you run into real shared memory issues. For that matter, you could even let each work just delete dup above their region, but ditto.
Even dividing up the work begs the question. It's expensive for each worker to walk though everyone else's list for each of its own entries. *(N-1)len(region)/2. Each worker could create a set of it's region, or sort it's region. Either would permit faster checks, but the costs add up.
My question seems elementary at first, but bear with me.
I wrote the following code in order to test how long it would take python to count from 1 to 1,000,000.
import time
class StopWatch:
def __init__(self, startTime = time.time()):
self.__startTime = startTime
self.__endTime = 0
def getStartTime(self):
return self.__startTime
def getEndTime(self):
return self.__endTime
def stop(self):
self.__endTime = time.time()
def start(self):
self.__startTime = time.time()
def getElapsedTime(self):
return self.__endTime - self.__startTime
count = 0
Timer = StopWatch()
for i in range(1, 1000001):
count += i
Timer.stop()
total_Time = Timer.getElapsedTime()
print("Total time elapsed to count to 1,000,000: ",total_Time," milliseconds")
I calculated a surprisingly short time span. It was 0.20280098915100098 milliseconds. I first want to ask: Is this correct?
I expected execution to be at least 2 or 3 milliseconds, but I did not anticipate it would be able to make that computation in less than a half of a millisecond!
If this is correct, that leads me to my secondary question: WHY is it so fast?
I know CPUs are essentially built for arithmetic, but I still wouldn't anticipate it being able to count to one million in two tenths of a millisecond!
Maybe you were tricked by time measure unit, as #jonrsharpe commented.
Nevertheless, a 3rd generation Intel i7 is capable of 120+GIPS (i.e. billions of elementary operations per second), so assuming all cache hits and no context switch (put simply, no unexpected waits), it could easily count from 0 to 1G in said time and even more. Probably not with Python, since it has some overhead, but still possible.
Explaining how a modern CPU can achieve such an... "insane" speed is quite a broad subject, actually the collaboration of more than one technology:
a dynamic scheduler will rearrange elementary instructions to reduce conflicts (thus, waits) as much as possible
a well-engineered cache will promptly provide code and (although less problematic for this benchmark) data.
a dynamic branch predictor will profile code and speculate on branch conditions (e.g. "for loop is over or not?") to anticipate jumps with a high chance of "winning".
a good compiler will provide some additional effort by rearranging instructions in order to reduce conflicts or making loops faster (by unrolling, merging, etc.)
multi-precision arithmetic could exploit vectorial operations provided by the MMX set and similar.
In short, there is more than a reason why those small wonders are so expensive :)
First, as has been pointed out, time() output is actually in seconds, not milliseconds.
Also, you are actually performing 1m additions to a total of 1m**2 /2, not counting to 1m, and you are initializing a million-long list (unless you are on python 3) with range.
I ran a simpler test on my laptop:
start = time.time()
i = 0;
while i < 1000000:
i+=1
print time.time() - start
Result:
0.069179093451
So, 70 milliseconds. That translates to 14 million operations per second.
Let's look at the table that Stefano probably referred to (http://en.wikipedia.org/wiki/Instructions_per_second) and do a rough estimation.
They don't have an i5 like I do, but the slowest i7 will be close enough. It clocks 80 GIPS with 4 cores, 20 GIPS per core.
(By the way, if your question is "how does it manage to get 20 GIPS per core?", can't help you. It's maaaagic nanotechnology)
So the core is capable of 20 billion operations per second, and we get only 14 million - different by a factor of 1400.
At this point the right question is not "why so fast?", by "why so slow?". Probably python overhead. What if we try this in C?
#include <stdio.h>
#include <unistd.h>
#include <time.h>
int i = 0;
int million = 1000000;
int main() {
clock_t cstart = clock();
while (i < million) {
i += 1;
}
clock_t cend = clock();
printf ("%.3f cpu sec\n", ((double)cend - (double)cstart) / CLOCKS_PER_SEC);
return 0;
}
Result:
0.003 cpu sec
This is 23 times faster than python, and only 60 times different from the number of theoretical 'elementary operations' per second. I see two operations here - comparison and addition, so 30 times different. This is entirely reasonable, as elementary operations are probably much smaller than our addition and comparison (let assembler experts tell us), and also we didn't factor in context switches, cache misses, time calculation overhead and who knows what else.
This also suggests that python performs 23 times as much operations to do the same thing. This is also entirely reasonable, because python is a high-level language. This is the kind of penalty you get in high level languages - and now you understand why speed-critical sections are usually written in C.
Also, python's integers are immutable, and memory should be allocated for each new integer (python runtime is smart about it, but nevertheless).
I hope that answers your question and teaches you a little bit about how to perform incredibly rough estimations =)
Short answer: As jonrsharpe mentioned in the comments, it's seconds, not milliseconds.
Also as Stefano said, xxxxxx --> check his posted answer. It has a lot of detail beyond just the ALU.
I'm just writing to mention - when you make default values in your classes or functions, make sure to use simple immutable instead of putting a function call or something like that. Your class is actually setting the start time of the timer for all instances - you will get a nasty surprise if you create a new Timer because it will use the previous value as the initial value. Try this and the timer does not get reset for the second Timer
#...
count = 0
Timer = StopWatch()
time.sleep(1)
Timer - StopWatch()
for i in range(1, 1000001):
count += i
Timer.stop()
total_Time = Timer.getElapsedTime()
print("Total time elapsed to count to 1,000,000: ",total_Time," milliseconds")
You will get about 1 second instead of what you expect.
I wrote a recursive function that exhaustively generates matrices of certain characteristics.
The function is as such:
def heavies(rowSums,colSums,colIndex,matH):
if colIndex == len(colSums) - 1:
for stuff in heavy_col_permutations(rowSums,colSums,colIndex):
matH[:,colIndex] = stuff[0]
yield matH.copy()
return
for stuff in heavy_col_permutations(rowSums,colSums,colIndex):
matH[:,colIndex] = stuff[0]
rowSums = stuff[1]
for matrix in heavies(rowSums,colSums,colIndex+1,matH):
yield matrix
and heavy_col_permutations is a function that just returns a column of a matrix with characteristics I need as well.
The problem is that as heavies is yielding a lot of matrices, it takes up too much memory.
I end up calling this from another function one by one, and eventually I take up too much RAM and my process is killed (I'm running this on a server with memory caps). How can I write this to make it use less memory?
The program looks something like:
r = int(argv[1])
n = int(argv[2])
m = numpy.zeros((r,r),numpy.dtype=int32)
for row,col in heavy_listing(r,n):
for matrix in heavies(row,col,0,m):
# do more stuff with matrix
And I know that the function heavies is where the large amount of memory sucking is happening, I just need to lessen it.
Things you can try:
Ensure that the matrix copies created by heavies() are not kept referenced in memory.
Look at the gc module, call collect() and play around with set_threshold()
Rewrite the function to be iterative instead of recursive