I need to calculate the mean of an array (length n) but only to the i-ith element (i<=n). For example an Array filled with dice rolls.
x = {1,4,5,3,6,...}
My current method is to use a loop and numpy.mean and slice the array each step:
x_mean_ith[0] = x[0]
for i in range(1,n):
x_mean_ith[i] = np.mean(x[:i])
This Method is too slow and i need it to be significantly faster. Currently it takes this part of the code ~ 2mins when the array is in the order of n = 10^6.
Is there maybe a smarter way to calculate this without it taking to much time , memory usage is not important.
You could do it using efficient (vectorized) cumulative sum:
x_mean_ith = np.cumsum(x) / np.arange(1,len(x)+1)
Related
Let's say that I have a data stream where single data point is retrieved at a time:
import numpy as np
def next_data_point():
"""
Mock a data stream. Data points will always be a positive float
"""
return np.random.uniform(0, 1_000_000, dtype='float')
I need to be able to update a NumPy array and track the top-K smallest-values-so-far from this stream (or until the user decides when it is okay to stop the analysis via some check_stop_condition() function). Let's say we want to capture the top 1,000 smallest values from the stream, then a naive way to accomplish this might be:
k = 1000
topk = np.full(k, fille_value=np.inf, dtype='float')
while check_stop_condition():
topk[:] = np.sort(np.append(topk, next_data_point()))[:k]
This works fine but is quite inefficient and can be slow if repeated millions of times since we are:
creating a new array every time
sorting the concatenated array every time
So, I came up with a different approach to address these 2 inefficiencies:
k = 1000
topk = np.full(k, fille_value=np.inf)
while check_stop_condition():
data_point = next_data_point()
idx = np.searchsorted(topk, data_point)
if idx < k:
topk[idx : -1] = topk[idx + 1 :]
topk[idx] = data_point
Here, I leverage np.searchsorted() to replace np.sort and to quickly find the insertion point, idx, for the next data point. I believe that np.searchsorted uses some sort of binary search and assumes that the initial array is pre-sorted first. Then, we shift the data in topk to accommodate and insert the new data point if and only if idx < k.
I haven't seen this being done anywhere and so my question is if there is anything that can be done to make this even more efficient? Especially in the way that I shifting things around inside the if statement.
Sorting a huge array is very expensive so this is not surprising the second method is faster. However, the speed of the second method is probably bounded by the slow array copy. The complexity of the first method is O(k log(k) n) while the second method has a complexity of O(n (log(k) + k * p)), with n the number of points and p the probability of the branch to be taken.
To build a faster implementation, you can use a tree. More specifically a self-balancing binary search tree for example. Here is the algorithm:
topk = Tree()
maxi = np.inf
while check_stop_condition(): # O(n)
data_point = next_data_point()
if len(topk) <= 1000: # O(1)
topk.insert(data_point) # O(log k)
elif data_point < maxi: # Discard the value in O(1)
topk.insert(data_point) # O(log k)
topk.deleteMaxNode() # O(log k)
maxi = topk.findMaxValue() # O(log k)
The above algorithm run in O(n log k). One can show that this complexity is optimal (using only data_point comparisons).
In practice, binary heaps can be a bit faster (with the same complexity). Indeed, they have several advantage over self-balancing binary search trees in this case:
they can be implemented in a very compact way in memory (reducing cache misses and memory consumption)
insertion of the n=1000 first items can be done in O(n) time and very quickly
Note that discarded values are computed in constant time. This appends a lot on huge random datasets as most of the values get quickly bigger than maxi. On can even prove that random datasets can be computed in O(n) time (optimal).
Note that Python 3 provides a standard heap implementation called heapq which is probably a good starting point.
The quantity to be computed is log(k!), where k could be 4000 or even higher, but of course the log will compensate. I tried computing sum(log(k)) which is the same.
So, I am given an large array with integers and I want to efficiently compute sum(log(k)). This was my attempt:
integers = np.asarray([435, 535, 242,])
score = np.sum(np.log(np.arange(1,integers+1)))
This would work, except that np.arange would generate an array of different size for each integer, so when I run that, it gives me an error (as it should).
The problem could be easily solved with a for loop as follows:
scores = []
for i in range(integers.shape[0]):
score = np.sum(np.log(np.arange(1,integer[i]+1)))
scores.append(score)
but that's too slow. My actual integers has millions of value to be computed.
Is there an efficient implementation for this that basically that doesn't need a for loop? I was thinking of a lambda function or something like that, but I am not really sure how to apply it. Any help is appreciated!
How about math.lgamma? Gamma function is factorial, and lgamma is log of gamma.
You don't need to compute factorial and then log.
There is also gammaln in the SciPy
Code, Python 3.9 x64 Win 10
import numpy as np
from scipy.special import gammaln
startf = 1 # start of factorial sequence
stopf = 400 # end of of factorial sequence
q = gammaln(range(startf+1, stopf+1)) # n! = G(n+1)
print(q)
looks reasonable to me
You can vectorize with something like this:
mi = integers.max()
ls = np.log(np.arange(2, mi + 1))
Two optimizations so far: you only need the range up to the maximum, since the other numbers are covered by that, and you don't need log(1).
Now you take the cumulative sum:
cs = np.cumsum(ls)
The desired elements can be indexed directly:
result = cs[integers - 2]
If this is something you need to do many times, and you know the upper bound, this solution will be much faster than using math.lgmamma or scipy.special.gammaln once you precompute cs to the upper bound.
If this is a one-time call, here is the obligatory one-liner:
np.cumsum(np.log(np.arange(2, np.max(integers))))[integers - 2]
You can do most of the operations in-place if memory is a concern (I think it also makes them faster):
mi = integers.max()
cs = np.arange(2, mi + 1)
np.cumsum(np.log(cs, out=cs), out=cs)
I am currently working with sparse matrix in python. I choose to use lil_matrix for my problem because as explained in the documentation lil_matrix are intended to be used for constructing a sparse matrix. My sparse matrix has dimensions 2500x2500
I have two piece of code inside two loops (which iterate in the matrix elements) which are having different execution time and I want to understand why. The first one is
current = lil_matrix_A[i,j]
lil_matrix_A[i, j] = current + 1
lil_matrix_A[j, i] = current + 1
Basically just taking every element of the matrix and incrementing its value by one.
And the second one is as below
value = lil_matrix_A[i, j]
temp = (value * 10000) / (dictionary[listA[i]] * dictionary[listB[j]])
lil_matrix_A[i, j] = temp
lil_matrix_A[j, i] = temp
Basically taking the value, making the calculation of a formula and inserting this new value to the matrix.
The first code is executed for around 0.4 seconds and the second piece of code is executed for around 32 seconds.
I understand that the second one has an extra calculation in the middle, but the time difference, in my opinion, does not make sense. The dictionary and list indexing have O(1) complexity so it is not supposed to be a problem. Is there any suggestion what it is causing this difference in execution time?
Note: The number of elements in list and dictionary is also 2500.
I have 2 arrays of unequal size:
>>> np.size(array1)
4004001
>>> np.size(array2)
1000
Now, each element in array2 needs to be compared to all the elements in array1, to find the element which has the nearest value to that of this element in array2.
Upon finding this value, I need to store it in a different array of size 1000 - one of a size corresponding to array2.
The tedious and crude way of doing it could be using a for loop and taking each element from Array 2, subtracting its absolute value from array 1 elements and then taking the minimum value- this is going to make my code really slow.
I'd like to use numpy vectorized operations to do this but i've kind of hit a wall.
To make full use of the numpy parallelism we need vectorized functions. Further all values are found in the same array (array1) using the same criterium (nearest). Therefore, it is possible to make a special function for searching in array1 specifically.
However, to make the solution more reusable it is better to make a more general solution and then transform it into a more specific one. Thus, as a general approach to find the closest value, we start with this find nearest solution. Then we turn that into a more specific and vectorize it, to allow it to work on multiple element at once:
import math
import numpy as np
from functools import partial
def find_nearest_sorted(array,value):
idx = np.searchsorted(array, value, side="left")
if idx > 0 and (idx == len(array) or math.fabs(value - array[idx-1]) < math.fabs(value - array[idx])):
return array[idx-1]
else:
return array[idx]
array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)
array1_sorted = np.sort(array1)
# Partially apply array1 to find function, to turn the general function
# into a specific, working with array1 only.
find_nearest_in_array1 = partial(find_nearest_sorted, array1_sorted)
# Vectorize specific function to allow us to apply it to all elements of
# array2, the numpy way.
vectorized_find = np.vectorize(find_nearest_in_array1)
output = vectorized_find(array2)
Hopefully this is what you wanted, a new vector, mapping the data in array2 to the nearest values in array1.
The most "numpythonic" way is is to use broadcasting. This is a quick and easy way to calculate a distance matrix, for which you can then take the argmin of the absolute value.
array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)
# Calculate distance matrix (on truncated array1 for memory reasons)
dmat = array1[:400400] - array2[:,None]
# Take the abs of the distance matrix and work out the argmin along the last axis
ix = np.abs(dmat).argmin(axis=1)
shape of dmat:
(1000, 400400)
shape of ix and contents:
(1000,)
array([237473, 166831, 72369, 11663, 22998, 85179, 231702, 322752, ...])
However, it's memory hungry if you do this operation in one go, and actually doesn't work on my 8GB machine for the size of arrays that you specify, which is why I reduced the size of array1.
To make it work within memory constraints, simply slice one of the arrays into chunks and apply broadcasting on each chunk in turn (or parallelise). In this case, I've sliced array2 into 10 chunks:
# Define number of chunks and calculate chunk size
n_chunks = 10
chunk_len = array2.size // n_chunks
# Preallocate output array
out = np.zeros(1000)
for i in range(n_chunks):
s = slice(i*chunk_len, (i+1)*chunk_len)
out[s] = np.abs(array1 - array2[s, None]).argmin(axis=1)
import numpy as np
a = np.random.random(size=4004001).astype(np.float16)
b = np.random.random(size=1000).astype(np.float16)
#use numpy broadcasting to compare pairwise difference and then find the min arg in a for each element in b. Finally extract elements from a using the argmin array as indexes.
output = a[np.argmin(np.abs(b[:,None] -a),axis=1)]
This solution while simple can be very memory intensive. It may need a bit further optimisation if using it on large arrays.
I have some very large lists that I am working with (>1M rows), and I am trying to find a fast (the fastest?) way of, given a float, ranking that float compared to the list of floats, and finding it's percentage rank compared to the range of the list. Here is my attempt, but it's extremely slow:
X =[0.595068426145485,
0.613726840488019,
1.1532608695652,
1.92952380952385,
4.44137931034496,
3.46432160804035,
2.20331487122673,
2.54736842105265,
3.57702702702689,
1.93202764976956,
1.34720184204056,
0.824997304105564,
0.765782842381996,
0.615110856990126,
0.622708022872803,
1.03211045820975,
0.997225012974318,
0.496352327702226,
0.67103858866700,
0.452224068868272,
0.441842124852685,
0.447584524952608,
0.4645525042246]
val = 1.5
arr = np.array(X) #X is actually a pandas column, hence the conversion
arr = np.insert(arr,1,val, axis=None) #insert the val into arr, to then be ranked
st = np.sort(arr)
RANK = float([i for i,k in enumerate(st) if k == val][0])+1 #Find position
PCNT_RANK = (1-(1-round(RANK/len(st),6)))*100 #Find percentage of value compared to range
print RANK, PCNT_RANK
>>> 17.0 70.8333
For the percentage ranks I could probably build a distribution and sample from it, not quite sure yet, any suggestions welcome...it's going to be used heavily so any speed-up will be advantageous.
Thanks.
Sorting the array seems to be rather slow. If you don't need the array to be sorted in the end, then numpy's boolean operations are quicker.
arr = np.array(X)
bool_array = arr < val # Returns boolean array
RANK = float(np.sum(bool_array))
PCT_RANK = RANK/len(X)
Or, better yet, use a list comprehension and avoid numpy all together.
RANK = float(sum([x<val for x in X]))
PCT_RANK = RANK/len(X)
Doing some timing, the numpy solution above gives 6.66 us on my system while the list comprehension method gives 3.74 us.
The two slow parts of your code are:
st = np.sort(arr). Sorting the list takes on average O(n log n) time, where n is the size of the list.
RANK = float([i for i, k in enumerate(st) if k == val][0]) + 1. Iterating through the list takes O(n) time.
If you don't need to sort the list, then as #ChrisMueller points out, you can just iterate through it once without sorting, which takes O(n) time and will be the fastest option.
If you do need to sort the list (or have access to it pre-sorted), then the fastest option for the second step is RANK = np.searchsorted(st, val) + 1. Since the list is already sorted, finding the index will only take O(log n) time by binary search instead of having to iterate through the whole list. This will still be a lot faster than your original code.