Numpy array.resize() - zeros 'first' - python

I can use array.resize(shape) to resize my array and have zeros added to those indices without any value. If my array is [1,2,3,4] and I use array.resize[5,0] I get [1,2,3,4,0]. How can I append / pad the zeros to the front, yielding [0,1,2,3,4]?
I am doing this dynamically - trying to use:
array.resize(arrayb.shape)
I would like to avoid (at all costs) making an in-memory copy of the array. That is to reverse the array, resize, and reverse again. Working with a view would be ideal.

You could try working on an array with negative strides (though you can never be sure that resize may not have to make a copy):
_a = np.empty(0) # original array
a = _a[::-1] # the array you work with...
# now instead of a, resize the original _a:
del a # You need to delete it first. Or resize will want refcheck=False, but that
# will be dangerous!
_a.resize(5)
# And update a to the new array:
a = _a[::-1]
But I would really suggest you make the array large enough if in any way possible, this does not seem very beautiful, but I think this is the only way short of copying around data. Your array will also have a negative stride, so it won't be contiguous, so if that means that some function you use on it must make copy, you are out of luck.
Also if you slice your a or _a you have to either make a copy, or make sure you delete them before resizing. While you can give refcheck=False this seems to invalidate the data.

I believe you can use slice assignment to do this. I see no reason why numpy would need to make a copy for an operation like this, as long as it does the necessary checks for overlaps (though of course as others have noted, resize may itself have to allocate a new block of memory). I tested this method with a very large array, and I saw no jump in memory usage.
>>> a = numpy.arange(10)
>>> a.resize(15)
>>> a[5:] = a[:10]
>>> a[0:5] = 0
>>> a
array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
The following showed no jump in memory usage for the assignment operation:
>>> a = numpy.arange(100000000)
>>> a.resize(150000000)
>>> a[50000000:] = a[:100000000]
I don't know of a better way, and this is just a conjecture. Let me know if it doesn't work.

Related

randomly choose different sets in numpy?

I am trying to randomly select a set of integers in numpy and am encountering a strange error. If I define a numpy array with two sets of different sizes, np.random.choice chooses between them without issue:
Set1 = np.array([[1, 2, 3], [2, 4]])
In: np.random.choice(Set1)
Out: [4, 5]
However, once the numpy array are sets of the same size, I get a value error:
Set2 = np.array([[1, 3, 5], [2, 4, 6]])
In: np.random.choice(Set2)
ValueError: a must be 1-dimensional
Could be user error, but I've checked several times and the only difference is the size of the sets. I realize I can do something like:
Chosen = np.random.choice(N, k)
Selection = Set[Chosen]
Where N is the number of sets and k is the number of samples, but I'm just wondering if there was a better way and specifically what I am doing wrong to raise a value error when the sets are the same size.
Printout of Set1 and Set2 for reference:
In: Set1
Out: array([list([1, 3, 5]), list([2, 4])], dtype=object)
In: type(Set1)
Out: numpy.ndarray
In: Set2
Out:
array([[1, 3, 5],
[2, 4, 6]])
In: type(Set2)
Out: numpy.ndarray
Your issue is caused by a misunderstanding of how numpy arrays work. The first example can not "really" be turned into an array because numpy does not support ragged arrays. You end up with an array of object references that points to two python lists. The second example is a proper 2xN numerical array. I can think of two types of solutions here.
The obvious approach (which would work in both cases, by the way), would be to choose the index instead of the sublist. Since you are sampling with replacement, you can just generate the index and use it directly:
Set[np.random.randint(N, size=k)]
This is the same as
Set[np.random.choice(N, k)]
If you want to choose without replacement, your best bet is to use np.random.choice, with replace=False. This is similar to, but less efficient than shuffling. In either case, you can write a one-liner for the index:
Set[np.random.choice(N, k, replace=False)]
Or:
index = np.arange(Set.shape[0])
np.random.shuffle(index)
Set[index[:k]]
The nice thing about np.random.shuffle, though, is that you can apply it to Set directly, whether it is a one- or many-dimensional array. Shuffling will always happen along the first axis, so you can just take the top k elements afterwards:
np.random.shuffle(Set)
Set[:k]
The shuffling operation works only in-place, so you have to write it out the long way. It's also less efficient for large arrays, since you have to create the entire range up front, no matter how small k is.
The other solution is to turn the second example into an array of list objects like the first one. I do not recommend this solution unless the only reason you are using numpy is for the choice function. In fact I wouldn't recommend it at all, since you can, and probably should, use pythons standard random module at this point. Disclaimers aside, you can coerce the datatype of the second array to be object. It will remove any benefits of using numpy, and can't be done directly. Simply setting dtype=object will still create a 2D array, but will store references to python int objects instead of primitives in it. You have to do something like this:
Set = np.zeros(N, dtype=object)
Set[:] = [[1, 2, 3], [2, 4]]
You will now get an object essentially equivalent to the one in the first example, and can therefore apply np.random.choice directly.
Note
I show the legacy np.random methods here because of personal inertia if nothing else. The correct way, as suggested in the documentation I link to, is to use the new Generator API. This is especially true for the choice method, which is much more efficient in the new implementation. The usage is not any more difficult:
Set[np.random.default_rng().choice(N, k, replace=False)]
There are additional advantages, like the fact that you can now choose directly, even from a multidimensional array:
np.random.default_rng().choice(Set2, k, replace=False)
The same goes for shuffle, which, like choice, now allows you to select the axis you want to rearrange:
np.random.default_rng().shuffle(Set)
Set[:k]

Truncating a large array (A = A[:N]) — does the original A ever get cleared? Can I "get rid of" A[:N].base and "transfer ownership"?

Suppose that I have a large array:
A = numpy.arange(100000000)
and now I truncate it:
A = A[:10]
I used to think that, given that I don't have a name bound to the original A any more, its reference count has dropped to zero and it will get garbage-collected. However, A.base surreptitiously still refers to the original array! Does that mean that the only way to clear this up is by making an explicit copy, i.e.
A = A[:10].copy()
or is there some other way to, so to say, transfer primary ownership of the memory used to the new object, while the original can be garbage collected? I'm worried that this may be the source of subtle memory leaks in parts of my code.
(remotely related question: Memory-efficient way to truncate large array in Matlab)
When you do this:
A = A[:10]
You are returning a view on the original A (because it's slice indexing) and not creating a new array. So indeed, the original A is not freed because you still need it.
The proper way is indeed to create a copy, either with:
A = A[:10].copy()
Or:
A = np.array(A[:10])
Per documentation:
All arrays generated by basic slicing are always views of the original array.
So, imagine your A is a Mona Lisa picture. And you set up a frame in front of it, so that it only contains Mona Lisa's head (when looked from the correct angle). If someone were to remove Mona Lisa, the "painting" of Mona Lisa's head in front of it would also disappear. You would need to copy what you see in the small frame to a new canvas to have an copy that would be safe against removal of the original.
You can verify this:
A = numpy.arange(100000000)
B = A[:10]
B[0] = 17
A[:5]
# => [17, 1, 2, 3, 4]
So you absolutely do need to copy in order to dissociate your new array from the original array. You can create a copy in a variety of ways. One is explicitly with copy, or with array constructor. You could also use advanced slicing, which doesn't return a view:
B = A[range(10)]
B[1] = 34
A[:5]
# => array([17, 1, 2, 3, 4])
B[:5]
# => array([17, 34, 2, 3, 4])

numpy : indexes too big giving sometimes exceptions, sometimes not

It seems really stupid, but I'm wondering why the following code (numpy 1.11.2) raise an exception:
import numpy as npy
a = npy.arange(0,10)
a[10]
An not this one:
import numpy as npy
a = npy.arange(0,10)
a[1:100]
I can understand, when we want to take part of an array, that's possible we don't really care if the index becomes too big (just taking what is in the array), but it seems a bit tricky too me: it's quite easy too didn't notice you're actually having a but on the way you're counting indexes, without an exception raising.
This is consistent with how Python lists (or sequences in general) behave:
>>> L = list(range(10))
>>> L[10]
IndexError
...
IndexError: list index out of range
>>> L[1:100]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> L[100:100]
[]
You cannot access an index that does not exit.
But you can have an empty range, i.e. an empty list or and empty NumPy array.
So when if one of the indices is outside of he size of the sequence, take what is there.
The Python tutorial uses a more positive wording:
However, out of range slice indexes are handled gracefully when used for slicing:
When you give the index 1:100, you use slicing. Python, in general, accepts slices larger than the list, and ignores remaining items, so there is no problem. However, when x[10], you specifically refer to the 11-th element (remember that lists start at 0), which does not exist, so you get an exception
In Python, Counting Begins at 0.
In your first example your array has 10 elements, but is indexed from 0 to 9. Therefore, calling a[10], you attempt to call the 11th element, which will give you an error as it outside of the valid index for your array.
As follows:
A = np.arange(0,10)
A = [0,1,2,3,4,5,6,7,8,9]
len(A) = 10
A[9] = 9
You can read about Python 0 indexing here:
https://docs.scipy.org/doc/numpy-1.10.0/user/basics.indexing.html

Perform a reverse cumulative sum on a numpy array

Can anyone recommend a way to do a reverse cumulative sum on a numpy array?
Where 'reverse cumulative sum' is defined as below (I welcome any corrections on the name for this procedure):
if
x = np.array([0,1,2,3,4])
then
np.cumsum(x)
gives
array([0,1,3,6,10])
However, I would like to get
array([10,10,9,7,4]
Can anyone suggest a way to do this?
This does it:
np.cumsum(x[::-1])[::-1]
You can use .flipud() for this as well, which is equivalent to [::-1]
https://docs.scipy.org/doc/numpy/reference/generated/numpy.flipud.html
In [0]: x = np.array([0,1,2,3,4])
In [1]: np.flipud(np.flipud(x).cumsum())
Out[1]: array([10, 10, 9, 7, 4]
.flip() is new as of NumPy 1.12, and combines the .flipud() and .fliplr() into one API.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.flip.html
This is equivalent, and has fewer function calls:
np.flip(np.flip(x, 0).cumsum(), 0)
The answers given so far seem to be all inefficient if you want the result stored in the original array. As well, if you want a copy, keep in mind this will return a view not a contiguous array and np.ascontiguousarray() is still needed.
How about
view=np.flip(x, 0)
np.cumsum(view, 0, out=view)
#x contains the reverse cumsum result and remains contiguous and unflipped
This modifies the flipped view of x which writes the data properly in reverse order back into the original x variable. It requires no non-contiguous views at the end of execution and is about as speed efficient as possible. I am guessing numpy will never add a reversecumsum method namely because the technique I describe is so trivially and efficiently possible. Albeit, it might be ever so slightly more efficient to have the explicit method.
Otherwise if a copy is desired, then the extra flip is required AND conversion back to a contiguous array, mainly if it will be used in many vector operations thereafter. A tricky part of numpy, but views and contiguity are something to be careful with if you are seriously interested in performance.

Assume zero for subsequent dimensions when slicing an array

I have need to slice an array where I would like zero to be assumed for every dimension except the first.
Given an array:
x = numpy.zeros((3,3,3))
I would like the following behavior, but without needing to know the number of dimensions before hand:
y = a[:,0,0]
Essentially I am looking for something that would take the place of Ellipsis, but instead of expanding to the needed number of : objects, it would expand into the needed number of zeros.
Is there anything built in for this? If not, what is the best way to get the functionality that I need?
Edit:
One way to do this is to use:
y = x.ravel(0:temp.shape[0])
This works fine, however in some cases (such as mine) ravel will need to create a copy of the array instead of a view. Since I am working with large arrays, I want a more memory efficient way of doing this.
You could create a indexing tuple, like this:
x = arange(3*3*3).reshape(3,3,3)
s = (slice(None),) + (0,)*(x.ndim-1)
print x[s] # array([ 0, 9, 18])
print x[:,0,0] # array([ 0, 9, 18])
I guess you could also do:
x.transpose().flat[:3]
but I prefer the first approach, since it works for any dimension (rather than only the first), and it's obviously equally efficient to just writing x[:,0,0], since it's just a different syntax.
I usually use tom10's method, but here's another:
for i in range(x.ndim-1):
x = x[...,0]

Categories

Resources