I have an array A, and I have a list of slicing indices (s,t), let's called this list L.
I want to find the 85 percentiles of A[s1:t1], A[s2:t2] ...
Is there a way to vectorize these operations in numpy?
ans = []
for (s,t) in L:
ans.append( numpy.percentile( A[s:t], 85) );
looks cumbersome.
Thanks a lot!
PS: it's safe to assume s1 < s2 .... t1 < t2 ..... This is really just a sliding window percentile problem.
Given that you're dealing with a non-uniform interval (i.e. the slices aren't the same size), no, there's no way to have numpy do it in a single function call.
If it was a uniform slice size, then you could do so with various tricks, as #eat commented.
However, what's wrong with a list comprehension? It's exactly equivalent to your loop above, but it looks "cleaner" if that's what you're worried about.
ans = [numpy.percentile(A[s:t], 85) for s,t in L]
Related
I'm trying to code something like this:
where x and y are two different numpy arrays and the j is an index for the array. I don't know the length of the array because it will be entered by the user and I cannot use loops to code this.
My main problem is finding a way to move between indexes since i would need to go from
x[2]-x[1] ... x[3]-x[2]
and so on.
I'm stumped but I would appreciate any clues.
A numpy-ic solution would be:
np.square(np.diff(x)).sum() + np.square(np.diff(y)).sum()
A list comprehension approach would be:
sum([(x[k]-x[k-1])**2+(y[k]-y[k-1])**2 for k in range(1,len(x))])
will give you the result you want, even if your data appears as list.
x[2]-x[1] ... x[3]-x[2] can be generalized to:
x[[1,2,3,...]-x[[0,1,2,...]]
x[1:]-x[:-1] # ie. (1 to the end)-(0 to almost the end)
numpy can take the difference between two arrays of the same shape
In list terms this would be
[i-j for i,j in zip(x[1:], x[:-1])]
np.diff does essentially this, a[slice1]-a[slice2], where the slices are as above.
The full answer squares, sums and squareroots.
This is more of a Matlab programming question than it is a math question.
I'd like to run gradient descent multiple on different learning rates. I have a set of learning rates
alpha = [0.3, 0.1, 0.03, 0.01, 0.003, 0.001];
and each time I run gradient descent, I get a vector J_vals as output. However, I don't know Matlab well enough to know how to implement this besides doing something like:
[theta, J_vals] = gradientDescent(...., alpha(1),...);
J1 = J_vals;
[theta, J_vals] = gradientDescent(...., alpha(2),...);
J2 = J_vals;
and so on.
I thought about using a for loop, but then I don't know how I would deal with the J_vals's (not sure how to apply the for loop to J1, J2, and so on). Perhaps it would look something like this:
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J(i) = J_vals;
end
Then I would have a vector of vectors.
In Python, I would just run a for loop and append each new result to the end of a list. How do I implement something like this in Matlab? Or is there a more efficient way?
If you know how many loops you are going have and the size of the J_vals (or at least a reasonable upper bound) I would suggest pre-allocating the size of the container array
J = zeros(n,1);
then on each loop insert the new values
J(start:start+n) = J_vals
That way you don't reallocate memory. If you don't know, you can append the values to the array. For example,
J = []; % initialize
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J = [J; J_vals]; % Append column row
end
but this is re-allocating the size of the array every loop. If it's not too many loops then it should be ok.
Matlab's "cell arrays" are kind of like lists in Python. They are similar in that you can put variable datatypes into them. Nobody seems to be too sure, but most likely the cell array is implemented as an array of object pointers. That means that it is still somewhat expensive to append to it (cell_array{length(cell_array) + 1} = new_data), but at least you are only appending a pointer instead of the entire column. You would still have to convert the cell array to a normal matrix afterward using cell2mat.
The most idiomatic Matlab solution is to pre-allocate (as #dpmcmlxxvi suggested).
I think what you are describing is a really common use case, and it's unfortunate that Matlab requires such a verbose idiom for this. Also it's frustrating that the documentation is opaque on how cell arrays are implemented and whether it is expensive to append to a cell array.
Your solution works just fine as long as you add a : for the row subscript (assuming J_vals is a column vector):
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J(:, i) = J_vals;
%// ^... all rows, column 'i'
end
You could even put that as the return value:
for i = len(alpha)
[theta, J(:, i)] = gradientDescent(..., alpha(i),...);
%// ^... add returned value directly to our list
end
Both of these methods allow you to preallocate your matrix for a potential speed gain.
If you want to build your list as you go, you can use the method in #dpmcmlxxvi's answer, or you can use the special subscript end. Neither of these methods are compatible with preallocation, though.
for i = len(alpha)
[theta, J(:, end+1)] = gradientDescent(..., alpha(i),...);
%// ^... add new vector after the current end of list
end
I would also like to suggest you not use i as a variable name in Matlab. I know it's natural for other languages, but in Matlab it overwrites the built-in imaginary constant i.
See: https://stackoverflow.com/a/14790765/1377097
I am having a small issue understanding indexing in Numpy arrays. I think a simplified example is best to get an idea of what I am trying to do.
So first I create an array of zeros of the size I want to fill:
x = range(0,10,2)
y = range(0,10,2)
a = zeros(len(x),len(y))
so that will give me an array of zeros that will be 5X5. Now, I want to fill the array with a rather complicated function that I can't get to work with grids. My problem is that I'd like to iterate as:
for i in xrange(0,10,2):
for j in xrange(0,10,2):
.........
"do function and fill the array corresponding to (i,j)"
however, right now what I would like to be a[2,10] is a function of 2 and 10 but instead the index for a function of 2 and 10 would be a[1,4] or whatever.
Again, maybe this is elementary, I've gone over the docs and find myself at a loss.
EDIT:
In the end I vectorized as much as possible and wrote the simulation loops that I could not in Cython. Further I used Joblib to Parallelize the operation. I stored the results in a list because an array was not filling right when running in Parallel. I then used Itertools to split the list into individual results and Pandas to organize the results.
Thank you for all the help
Some tips for your to get the things done keeping a good performance:
- avoid Python `for` loops
- create a function that can deal with vectorized inputs
Example:
def f(xs, ys)
return x**2 + y**2 + x*y
where you can pass xs and ys as arrays and the operation will be done element-wise:
xs = np.random.random((100,200))
ys = np.random.random((100,200))
f(xs,ys)
You should read more about numpy broadcasting to get a better understanding about how the arrays's operations work. This will help you to design a function that can handle properly the arrays.
First, you lack some parenthesis with zeros, the first argument should be a tuple :
a = zeros((len(x),len(y)))
Then, the corresponding indices for your table are i/2 and j/2 :
for i in xrange(0,10,2):
for j in xrange(0,10,2):
# do function and fill the array corresponding to (i,j)
a[i/2, j/2] = 1
But I second Saullo Castro, you should try to vectorize your computations.
So, I'm looking at python and I have a large 2d numpy array of data, and I want to take m rows of this large data matrix. I've looked into random.sample, and numpy.random.shuffle and numpy.random.permutation, all of these work, but usually they return the whole permutation or at least generate the entire range(n). If I had a very large dataset, then doing something like
data = numpy.random.uniform((n,100))
myvec = data[random.sample(range(n),m),:]
will allocate a vector range(n) which blows up pretty fast. So i thought I could use xrange, which return a generator, but hey, you can't just get any element from an generator, that's not the way they work.
I tried it out, and it works.
data = numpy.random.uniform((n,100))
myvec = data[random.sample(xrange(n),m),:]
Any idea how?
UPDATE:
I can use
samp = random.sample(range(n),10)
for n up to 100000000 before I get a memory error. If i use
samp = random.sample(xrange(n),10)
on the other hand, I only start getting errors because of int converson to C, namely, the int gets too long to get converted to C, at around 1000000000. Sure it's only a factor of 10, but I'm curious. the xrange variant is also much faster.
def sample(n, m):
d = set()
while len(d) < m:
d.add(randrange(n))
return d
>>> sample(100000000000000000000000000000000000, 10)
set([5577049102993258248888250482046894L, 86044086231860190654588187118815513L, 2021737354726858669049814270580972L, 6253501639432326715043836478191628L, 5306460388221333758367322518700483L, 62195356583363524099133566314034473L, 376650426515181012918370326724858L, 80588135672357701239461833469588557L, 1978959860575617450893346333245569L, 41904683348442252013350548717573039L])
Note that simple {randrange(n) for _ in range(m)} will do the job with very high probability.
So it turns out xrange and iterators can be accessed by indexing, which is exactly what random.sample() uses. So that's how it works.
a = xrange(10)
print a[5] #this works.
Elazar's solution works just as well though.
My question is about a specific array operation that I want to express using numpy.
I have an array of floats w and an array of indices idx of the same length as w and I want to sum up all w with the same idx value and collect them in an array v.
As a loop, this looks like this:
for i, x in enumerate(w):
v[idx[i]] += x
Is there a way to do this with array operations?
My guess was v[idx] += w but that does not work, since idx contains the same index multiple times.
Thanks!
numpy.bincount was introduced for this purpose:
tmp = np.bincount(idx, w)
v[:len(tmp)] += tmp
I think as of 1.6 you can also pass a minlength to bincount.
This is a known behavior and, though somewhat unfortunate, does not have a numpy-level workaround. (bincount can be used for this if you twist its arm.) Doing the loop yourself is really your best bet.
Note that your code might have been a bit more clear without re-using the name w and without introducing another set of indices, like
for i, w_thing in zip(idx, w):
v[i] += w_thing
If you need to speed up this loop, you might have to drop down to C. Cython makes this relatively easy.