So, I'm looking at python and I have a large 2d numpy array of data, and I want to take m rows of this large data matrix. I've looked into random.sample, and numpy.random.shuffle and numpy.random.permutation, all of these work, but usually they return the whole permutation or at least generate the entire range(n). If I had a very large dataset, then doing something like
data = numpy.random.uniform((n,100))
myvec = data[random.sample(range(n),m),:]
will allocate a vector range(n) which blows up pretty fast. So i thought I could use xrange, which return a generator, but hey, you can't just get any element from an generator, that's not the way they work.
I tried it out, and it works.
data = numpy.random.uniform((n,100))
myvec = data[random.sample(xrange(n),m),:]
Any idea how?
UPDATE:
I can use
samp = random.sample(range(n),10)
for n up to 100000000 before I get a memory error. If i use
samp = random.sample(xrange(n),10)
on the other hand, I only start getting errors because of int converson to C, namely, the int gets too long to get converted to C, at around 1000000000. Sure it's only a factor of 10, but I'm curious. the xrange variant is also much faster.
def sample(n, m):
d = set()
while len(d) < m:
d.add(randrange(n))
return d
>>> sample(100000000000000000000000000000000000, 10)
set([5577049102993258248888250482046894L, 86044086231860190654588187118815513L, 2021737354726858669049814270580972L, 6253501639432326715043836478191628L, 5306460388221333758367322518700483L, 62195356583363524099133566314034473L, 376650426515181012918370326724858L, 80588135672357701239461833469588557L, 1978959860575617450893346333245569L, 41904683348442252013350548717573039L])
Note that simple {randrange(n) for _ in range(m)} will do the job with very high probability.
So it turns out xrange and iterators can be accessed by indexing, which is exactly what random.sample() uses. So that's how it works.
a = xrange(10)
print a[5] #this works.
Elazar's solution works just as well though.
Related
This is more of a Matlab programming question than it is a math question.
I'd like to run gradient descent multiple on different learning rates. I have a set of learning rates
alpha = [0.3, 0.1, 0.03, 0.01, 0.003, 0.001];
and each time I run gradient descent, I get a vector J_vals as output. However, I don't know Matlab well enough to know how to implement this besides doing something like:
[theta, J_vals] = gradientDescent(...., alpha(1),...);
J1 = J_vals;
[theta, J_vals] = gradientDescent(...., alpha(2),...);
J2 = J_vals;
and so on.
I thought about using a for loop, but then I don't know how I would deal with the J_vals's (not sure how to apply the for loop to J1, J2, and so on). Perhaps it would look something like this:
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J(i) = J_vals;
end
Then I would have a vector of vectors.
In Python, I would just run a for loop and append each new result to the end of a list. How do I implement something like this in Matlab? Or is there a more efficient way?
If you know how many loops you are going have and the size of the J_vals (or at least a reasonable upper bound) I would suggest pre-allocating the size of the container array
J = zeros(n,1);
then on each loop insert the new values
J(start:start+n) = J_vals
That way you don't reallocate memory. If you don't know, you can append the values to the array. For example,
J = []; % initialize
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J = [J; J_vals]; % Append column row
end
but this is re-allocating the size of the array every loop. If it's not too many loops then it should be ok.
Matlab's "cell arrays" are kind of like lists in Python. They are similar in that you can put variable datatypes into them. Nobody seems to be too sure, but most likely the cell array is implemented as an array of object pointers. That means that it is still somewhat expensive to append to it (cell_array{length(cell_array) + 1} = new_data), but at least you are only appending a pointer instead of the entire column. You would still have to convert the cell array to a normal matrix afterward using cell2mat.
The most idiomatic Matlab solution is to pre-allocate (as #dpmcmlxxvi suggested).
I think what you are describing is a really common use case, and it's unfortunate that Matlab requires such a verbose idiom for this. Also it's frustrating that the documentation is opaque on how cell arrays are implemented and whether it is expensive to append to a cell array.
Your solution works just fine as long as you add a : for the row subscript (assuming J_vals is a column vector):
for i = len(alpha)
[theta, J_vals] = gradientDescent(..., alpha(i),...);
J(:, i) = J_vals;
%// ^... all rows, column 'i'
end
You could even put that as the return value:
for i = len(alpha)
[theta, J(:, i)] = gradientDescent(..., alpha(i),...);
%// ^... add returned value directly to our list
end
Both of these methods allow you to preallocate your matrix for a potential speed gain.
If you want to build your list as you go, you can use the method in #dpmcmlxxvi's answer, or you can use the special subscript end. Neither of these methods are compatible with preallocation, though.
for i = len(alpha)
[theta, J(:, end+1)] = gradientDescent(..., alpha(i),...);
%// ^... add new vector after the current end of list
end
I would also like to suggest you not use i as a variable name in Matlab. I know it's natural for other languages, but in Matlab it overwrites the built-in imaginary constant i.
See: https://stackoverflow.com/a/14790765/1377097
I was unable to find anything describing how to do this, which leads to be believe I'm not doing this in the proper idiomatic Python way. Advice on the 'proper' Python way to do this would also be appreciated.
I have a bunch of variables for a datalogger I'm writing (arbitrary logging length, with a known maximum length). In MATLAB, I would initialize them all as 1-D arrays of zeros of length n, n bigger than the number of entries I would ever see, assign each individual element variable(measurement_no) = data_point in the logging loop, and trim off the extraneous zeros when the measurement was over. The initialization would look like this:
[dData gData cTotalEnergy cResFinal etc] = deal(zeros(n,1));
Is there a way to do this in Python/NumPy so I don't either have to put each variable on its own line:
dData = np.zeros(n)
gData = np.zeros(n)
etc.
I would also prefer not just make one big matrix, because keeping track of which column is which variable is unpleasant. Perhaps the solution is to make the (length x numvars) matrix, and assign the column slices out to individual variables?
EDIT: Assume I'm going to have a lot of vectors of the same length by the time this is over; e.g., my post-processing takes each log file, calculates a bunch of separate metrics (>50), stores them, and repeats until the logs are all processed. Then I generate histograms, means/maxes/sigmas/etc. for all the various metrics I computed. Since initializing 50+ vectors is clearly not easy in Python, what's the best (cleanest code and decent performance) way of doing this?
If you're really motivated to do this in a one-liner you could create an (n_vars, ...) array of zeros, then unpack it along the first dimension:
a, b, c = np.zeros((3, 5))
print(a is b)
# False
Another option is to use a list comprehension or a generator expression:
a, b, c = [np.zeros(5) for _ in range(3)] # list comprehension
d, e, f = (np.zeros(5) for _ in range(3)) # generator expression
print(a is b, d is e)
# False False
Be careful, though! You might think that using the * operator on a list or tuple containing your call to np.zeros() would achieve the same thing, but it doesn't:
h, i, j = (np.zeros(5),) * 3
print(h is i)
# True
This is because the expression inside the tuple gets evaluated first. np.zeros(5) therefore only gets called once, and each element in the repeated tuple ends up being a reference to the same array. This is the same reason why you can't just use a = b = c = np.zeros(5).
Unless you really need to assign a large number of empty array variables and you really care deeply about making your code compact (!), I would recommend initialising them on separate lines for readability.
Nothing wrong or un-Pythonic with
dData = np.zeros(n)
gData = np.zeros(n)
etc.
You could put them on one line, but there's no particular reason to do so.
dData, gData = np.zeros(n), np.zeros(n)
Don't try dData = gData = np.zeros(n), because a change to dData changes gData (they point to the same object). For the same reason you usually don't want to use x = y = [].
The deal in MATLAB is a convenience, but isn't magical. Here's how Octave implements it
function [varargout] = deal (varargin)
if (nargin == 0)
print_usage ();
elseif (nargin == 1 || nargin == nargout)
varargout(1:nargout) = varargin;
else
error ("deal: nargin > 1 and nargin != nargout");
endif
endfunction
In contrast to Python, in Octave (and presumably MATLAB)
one=two=three=zeros(1,3)
assigns different objects to the 3 variables.
Notice also how MATLAB talks about deal as a way of assigning contents of cells and structure arrays. http://www.mathworks.com/company/newsletters/articles/whats-the-big-deal.html
If you put your data in a collections.defaultdict you won't need to do any explicit initialization. Everything will be initialized the first time it is used.
import numpy as np
import collections
n = 100
data = collections.defaultdict(lambda: np.zeros(n))
for i in range(1, n):
data['g'][i] = data['d'][i - 1]
# ...
How about using map:
import numpy as np
n = 10 # Number of data points per array
m = 3 # Number of arrays being initialised
gData, pData, qData = map(np.zeros, [n] * m)
I have the following nested loop. But it is inefficient time wise. So using a generator would be much better. Do you know how to do that?
x_sph[:] = [r*sin_t*cos_p for cos_p in cos_phi for sin_t in sin_theta for r in p]
It seems like some of you are of the opinion (looking at comments) that using a generator was not helpful in this case. I am under the impression that using generators will avoid assigning variables to memory, and thus save memory and time. Am I wrong?
Judging from your code snippet you want to do something numerical and you want to do it fast. A generator won't help much in this respect. But using the numpy module will. Do it like so:
import numpy
# Change your p into an array, you'll see why.
r = numpy.array(p) # If p is a list this will change it into 1 dimensional vector.
sin_theta = numpy.array(sin_theta) # Same with the rest.
cos_phi = numpy.array(cos_phi)
x_sph = r.dot(sin_theta).dot(cos_phi)
In fact I'd use numpy even earlier, by doing:
phi = numpy.array(phi) # I don't know how you calculate this but you can start here with a phi list.
theta = numpy.array(theta)
sin_theta =numpy.sin(theta)
cos_phi = numpy.cos(phi)
You could even skip the intermediate sin_theta and cos_phi assignments and just put all the stuff in one line. It'll be long and complicated so I'll omit it but I do numpy-maths like that sometimes.
And numpy is fast, it'll make a huge difference. At least a noticeable one.
[...] creates a list and (...) a generator :
generator = (r*sin_t*cos_p for cos_p in cos_phi for sin_t in sin_theta for r in p)
for value in generator:
# Do something
To turn a loop into a generator, you can make it a function and yield:
def x_sph(p, cos_phi, sin_theta):
for r in p:
for sin_t in sin_theta:
for cos_p in cos_phi:
yield r * sin_t * cos_p
However, note that the advantages of generators are generally only realised if you don't need to calculate all values and can break at some point, or if you don't want to store all the values (the latter is a space rather than time advantage). If you end up calling this:
lst = list(x_sph(p, cos_phi, sin_theta))
then you won't see any gain.
I have an array A, and I have a list of slicing indices (s,t), let's called this list L.
I want to find the 85 percentiles of A[s1:t1], A[s2:t2] ...
Is there a way to vectorize these operations in numpy?
ans = []
for (s,t) in L:
ans.append( numpy.percentile( A[s:t], 85) );
looks cumbersome.
Thanks a lot!
PS: it's safe to assume s1 < s2 .... t1 < t2 ..... This is really just a sliding window percentile problem.
Given that you're dealing with a non-uniform interval (i.e. the slices aren't the same size), no, there's no way to have numpy do it in a single function call.
If it was a uniform slice size, then you could do so with various tricks, as #eat commented.
However, what's wrong with a list comprehension? It's exactly equivalent to your loop above, but it looks "cleaner" if that's what you're worried about.
ans = [numpy.percentile(A[s:t], 85) for s,t in L]
a other stupid question from my side ;) I have some issues with the following snippet with len(x)=len(y)=7'700'000:
from numpy import *
for k in range(len(x)):
if x[k] == xmax:
xind = -1
else:
xind = int(floor((x[k]-xmin)/xdelta))
if y[k] == ymax:
yind = -1
else:
yind = int(floor((y[k]-ymin)/ydelta))
arr = append(arr,grid[xind,yind])
All variables are floats or integers except arr and grid. arr is a 1D-array and grid is a 2D-array.
My problem is that it takes a long time to run through the loop (several minutes). Can anyone explain me, why this takes such a long time? Have anyone a suggestion? Even if I try to exchange range() through arange()then I save only some second.
Thanks.
1st EDIT
Sorry. Forgot to tell that I'm importing numpy
2nd EDIT
I have some points in a 2D-grid. Each cell of the grid have a value stored. I have to find out which position the point have and apply the value to a new array. That's my problem and my idea.
p.s.: look at the picture if you want to understand it better. the values of the cell are represented with different colors.
How about something like:
import numpy as np
xind = np.floor((x-xmin)/xdelta).astype(int)
yind = np.floor((y-ymin)/ydelta).astype(int)
xind[np.argmax(x)] = -1
yind[np.argmax(y)] = -1
arr = grid[xind,yind]
Note: if you're using numpy don't treat the arrays like python lists if you want to do things efficiently.
for x_item, y_item in zip(x, y):
# do stuff.
There's also izip for if you don't want to generate a giant extra list.
I cannot see an obvious problem, beside the size of the data. Is your computer able to hold everything in memory? If not, you are probably "jumping around" in swapped memory, which will always be slow. If the complete data is in memory, give psyco a try. It might speed up your calculation a lot.
I suspect the problem might be in the way you're storing the results:
arr = append(arr,grid[xind,yind])
The docs for append say it returns:
A copy of arr with values appended
to axis. Note that append does
not occur in-place: a new array is
allocated and filled.
This means you'll be deallocating and allocating a larger and larger array every iteration. I suggest allocating an array of the correct size up-front, then populating it with data in each iteration. e.g.:
arr = empty(len(x))
for k in range(len(x)):
...
arr[k] = grid[xind,yind]
x's lenght is 7 millions? I think that's why!
THe iterations ocurrs 7 millions times,
probably you shoud make another kind of loop.
It's really necesary looping over 7 m times?