Efficient way to drop a column from a Numpy array? - python

If I have a very large numpy array with one useless column, how could I drop it without creating a copy of the original array?
np.delete(my_np_array, 0, 1)
The above code will return a copy of the array without the zero-th column. But instead I would like to simply delete that column from my_np_array since I don't need it. For very large datasets, the memory management becomes important and copying may not be an option.

If memory is the main concern, what you can do is move columns around within your array such that the unneeded column gets at the very end of your array, then use ndarray.resize, which modifies he array in-place, to shrink it down and discard the outer column.
You cannot simply remove the first column of an array in-place using the provided API, and I suspect it is because of the memory layout of an ndarray that maps multidimensional indexing to unidimensional byte-oriented addressing within blocks of contiguous memory.
The following example copies the last column into the first and then deletes the last (now unneeded), immediately purging the associated memory. So it basically removes the obsolete column from memory completely, at the cost of changing your column order.
D1, D2 = A.shape
A[:, 0] = A[:, D2-1]
A.resize((D1, D2-1), refcheck=False)
A.shape
# => would be (5, 4) if the shape was initially (5, 5) for example

If you use slicing numpy won't make a copy; in other words
a = numpy.array([1, 2, 3, 4, 5])
b = a[1:] # view elements from second to last, NOT making a copy
b[0] = 12 # Change first element of `b`, i.e. second of `a`
print a
will reply [1, 12, 3, 4, 5]
If you need to delete an element in the middle however a single slicing won't work.

Numpy arrays are immutable. So they can't be re-sized without creating a intermediate copy.
How to remove specific elements in a numpy array
Creating a view with slicing, and make a copy of that is probably the fastest you can do.
In [804]: a = np.ones((2,2))
In [805]: a
Out[805]:
array([[ 1., 1.],
[ 1., 1.]])
In [806]: np.resize(a,(3,2))
Out[806]:
array([[ 1., 1.],
[ 1., 1.],
[ 1., 1.]])
In [807]: a <- a should now be resized if it was done inplace?
Out[807]:
array([[ 1., 1.],
[ 1., 1.]])

Related

Copying a list of numpy arrays

I am working with (lists of) lists of numpy arrays. As a bare bones example, consider this piece of code:
a = [np.zeros(5)]
b = a.copy()
b[0] += 1
Here, I copy a list of one array from a to b. However, the array itself is not copied, so:
print(a)
print(b)
both give [array([1., 1., 1., 1., 1.])]. If I want to make a copy of the array as well, I could do something like:
b = [arr.copy() for arr in a]
and a would remain unchanged. This works well for a simple list, but it becomes more complicated when working with nested lists of arrays where the number of arrays in each list is not always the same.
Is there a simple way to copy a multi-level list and every object that it contains without keeping references to the objects in the original list? Basically, I would like to avoid nested loops as well as dealing with the size of each individual sub-list.
What you are looking for is a deepcopy
import numpy as np
import copy
a = [np.zeros(5)]
b = copy.deepcopy(a)
b[0] += 1 # a[0] is not changed
This is actually method recommended in numpy doc for the deepcopy of object arrays.
You need to use deepcopy.
import numpy as np
import copy
a = [np.zeros(5)]
b = copy.deepcopy(a)
b[0] += 1
print(a)
print(b)
Result:
[array([0., 0., 0., 0., 0.])]
[array([1., 1., 1., 1., 1.])]

Struggling with numpy libs where()

I somehow got mixed up with primitive AI and came across this code which I am having hard time understanding.
I read some site but none seem to have answer I am looking for. :(
Could anyone explain np.where() function in this scenario?
It occured to me that this line of code makes child_pos an empty 2d array
if curr_node.get_curr_child() == 0
But I am not sure... Glad for every response.
The code in question is:
child_pos = np.where(np.asarray(curr_node.get_curr_child()) == 0)[0][0]
Disregarding your code, np.where returns the positions of the values you are searching for in the where statement.
For example:
Let's assume
matrix = array([[1., 1., 1.],
[1., 0., 1.],
[1., 1., 0.]])
If we were to run np.where(matrix == 0) what we would get is
(array([1, 2], dtype=int64),
array([1, 2], dtype=int64))
Which basically gives you the row/column positions of the value 0 in the original 2-dimensional array. The first array represents the row positions and the second array represents the column positions.
This logic extends to higher/lower dimensions as well.
Returning to your code, you turn the result of get_curr_child into an np array and then you fetch the first value from the first dimension of the np.where result.

Deleting multiple elements at once from a numpy 2d array

Is there a way to delete from a numpy 2d array when I have the indexes? For example:
a = np.random.random((4,5))
idxs = [(0,1), (1,3), (2, 1), (3,4)]
I want to remove the indexes specified above. I tried:
np.delete(a, idxs)
but it just removes the top row.
To give an example, for the following input:
[
[0.15393912, 0.08129568, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.60279229, 0.41168151],
[0.06330729, 0.60704682, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462, 0.07293321]
]
and with the indexes as mentioned above, I want the result to be:
[
[0.15393912, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.41168151],
[0.06330729, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462]
]
your index should be for flat array else it only works to remove a row or column.
Here is how you can convert index and use it
arr = np.array([
[0.15393912, 0.08129568, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.60279229, 0.41168151],
[0.06330729, 0.60704682, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462, 0.07293321]
])
idxs = [(0,1), (1,3), (2, 1), (3,4)]
idxs = [i*arr.shape[1]+j for i, j in idxs]
np.delete(arr, idxs).reshape(4,4)
for reshaping you should remove the items such that there will be equal number of items and rows and columns after deletion
Numpy doesn't know that you are removing exactly one element per row when you give it arbitrary indices like that. Since you do know that, I would suggest using a mask to shrink the array. Masking has the same problem: it doesn't assume anything about the shape of the result (because it can't in general), and returns a raveled array. You can reinstate the shape you want quite easily though. In fact, I would suggest removing the first element of each index entirely, since you have one per row:
def remove_indices(a, idx):
if len(idx) != len(idx): raise ValueError('Wrong number of indices')
mask = np.ones(a.size, dtype=np.bool_)
mask[np.arange(len(idx)), idx] = False
return a[mask].reshape(a.shape[0], a.shape[1] - 1)
Here is a method using np.where
import numpy as np
import operator as op
a = np.arange(20.0).reshape(4,5)
idxs = [(0,1), (1,3), (2, 1), (3,4)]
m,n = a.shape
# extract column indices
# there are simpler ways but this is fast
columns = np.fromiter(map(op.itemgetter(1),idxs),int,m)
# build decimated array
result = np.where(columns[...,None]>np.arange(n-1),a[...,:-1],a[...,1:])
result
# array([[ 0., 2., 3., 4.],
# [ 5., 6., 7., 9.],
# [10., 12., 13., 14.],
# [15., 16., 17., 18.]])
As the documentation says
Return a new array with sub-arrays along an axis deleted.
np.delete deletes a row or a column based on the value of the parameter axis.
Secondly np.delete expects int or array of ints as parameter and not a list of tuples.
you need to specify what the requirement is.
As #divakar suggested look at other answers on Stackoverflow regarding deleting individual items in numpy array.

How does numpy array typing interact with object?

I am currently trying to implement a datatype that stores floats in an numpy array. However trying to assign an array with elements of this type with various lengths seems to obviously break the code. One would assign a sequence to an array element, which is not possible.
One can bypass this by using the data type object instead of float. Why is that? How could one resolve this problem using floats without creating a sequence?
Example code that does not work.
from numpy import *
foo= dtype(float32, [])
x = array([[2., 3.], [3.]], dtype=foo)
Example code that does work:
from numpy import *
foo= dtype(float32, [])
x = array([[2., 3.], [3., 2.]], dtype=foo)
Example code that does work, I try to replicate for float:
from numpy import *
foo= dtype(object, [])
x = array([[2., 3.], [3.]], dtype=foo)
The object dtype in Numpy simply creates an array of pointers to Python objects. This means you lose the performance advantage you usually get from Numpy, but it's still sometimes useful to do this.
Your last example creates a one-dimensional Numpy array of length two, so that's two pointers to Python objects. Both these objects happen to be lists, and Python list have arbitrary dynamic length.
I don't know what you were trying to achieve with this, but note that
>>> np.dtype(np.float32, []) == np.float32
True
Arrays require the same number of elements for each row. So, if you feed a list of lists in numpy and all sublists have the same number of elements, it'll happily convert it to an array. This is why your second example works.
If the sublists are not the same length, then each sublist is treated as a single object and you end up with a 1D array of objects. This is why your third example works. Your first example doesn't work because you try to cast a sequence of objects to floats, which isn't possible.
In short, you can't create an array of floats if your sublists are of different lengths. At best, you can create an array of 1D arrays, since they are still considered objects.
>>> x = np.array(list(map(np.array, [[2., 3.], [3.]])))
>>> x
array([array([ 2., 3.]), array([ 3.])], dtype=object)
>>> x[0]
array([ 2., 3.])
>>> x[0][1]
3.0
>>> # but you can't do this
>>> x[0,1]
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
x[0,1]
IndexError: too many indices for array
If you're bent on creating a float 2D array, you have to extend all your sublists to the same size with None, which will be converted to np.nan.
>>> lists = [[2., 3.], [3.]]
>>> max_len = max(map(len, lists))
>>> for i, sublist in enumerate(lists):
sublist = sublist + [None] * (max_len - len(sublist))
lists[i] = sublist
>>> np.array(lists, dtype=np.float32)
array([[ 2., 3.],
[ 3., nan]], dtype=float32)

clearing elements of numpy array

Is there a simple way to clear all elements of a numpy array? I tried:
del arrayname
This removes the array completely. I am using this array inside a for loop that iterates thousands of times, so I prefer to keep the array but populate it with new elements every time.
I tried numpy.delete, but for my requirement I don't see the use of subarray specification.
*Edited*:
The array size is not going to be the same.
I allocate the space, inside the loop at the beginning, as follows. Please correct me if this is a wrong way to go about:
arrname = arange(x*6).reshape(x,6)
I read a dataset and construct this array for each tuple in the dataset. All I know is the number of columns is going to be the same but not the number of rows. For example, the first time I might need an array of size (3,6), for the next tuple as (1,6) and the next time as (4,6) and so on. The way I populate the array is as follows:
arrname[:,0] = lstname1
arrname[:,1] = lstname2
...
In other words, the columns are filled from lists constructed from the tuples. So, before the next loop begins I want to clear its elements and make it ready for the consecutive loop since I don't want remnants from the previous loop mixing the current contents.
I'm not sure what you mean by clear, the array will always have some values stored in it, but you can set those values to something, for example:
>>> A = numpy.array([[1, 2], [3, 4], [5, 6]], dtype=numpy.float)
>>> A
array([[ 1., 2.],
[ 3., 4.],
[ 5., 6.]])
>>> A.fill(0)
>>> A
array([[ 0., 0.],
[ 0., 0.],
[ 0., 0.]])
>>> A[:] = 1.
>>> A
array([[ 1., 1.],
[ 1., 1.],
[ 1., 1.]])
Update
First, your question is very unclear. The more effort you put into writing a good question the better answers you'll get. A good question should make it clear to us what you're trying to do and why. Also example data is very helpful, just a small amount, so we can see exactly what you're trying to do.
That being said. It seems like maybe you should just create a new array for each iteration. Creating arrays is pretty fast and it's not clear why you would want to reuse an array when the size and contents need to change. If you're trying to reuse it for performance reasons, you're probably not going to see any measurable difference, resizing arrays is not noticeably faster than creating a new array. You can create a new array by calling numpy.zeros((X, 6))
Also in your question you say:
the columns are filled from lists constructed from the tuples
If your data is already housed as a list of tuples you use numpy.array to convert it to an array. You don't need to go the the trouble of creating an array and filling it. For example if I wanted to get a (2, 3) array from a list of tuples I would do:
data = [(0, 0, 1), (0, 0, 2)]
A = numpy.array(data)
# or if the data is stored like this
data = [(0, 0), (0, 0), (1, 2)]
A = numpy.array(data).T
Hope that helps.
With a wag of the finger for possible premature optimization, I will offer some thoughts:
You say you don't want any remnants left over from previous iterations. From your code it looks like you populate each of the new elements column by column for each of the known number of columns. "Left over" values don't look like a problem. consider:
using arange and reshape serves no purpose. use np.empty((n,6)). Faster than ones or zeros by a hair.
you could alternatively construct your new array from the constituents
See:
lstname1 = np.arange(3)
lstname2 = 22*np.arange(3)
np.vstack((lstname1,lstname2)).T
# returns
array([[ 0, 0],
[ 1, 22],
[ 2, 44]])
#or
np.hstack((lstname1[:,np.newaxis],lstname2[:,np.newaxis]))
array([[ 0, 0],
[ 1, 22],
[ 2, 44]])
Lastly, If you are really really concerned about speed, you could allocate the largest expected size (if not known the you could check the requested size vs the last largest and if it is larger then use np.empty((rows,cols)) to increase the size.
Then at each iteration, your create a view of the larger matrix of just the number of rows you want. This will cause numpy to reuse the same buffer space and not need to to any allocation at each of your iterations. Notice:
In [36]: big = np.vstack((lstname1,lstname2)).T
In [37]: smaller = big[:2]
In [38]: smaller[:,1]=33
In [39]: smaller
Out[39]:
array([[ 0, 33],
[ 1, 33]])
In [40]: big
Out[40]:
array([[ 0, 33],
[ 1, 33],
[ 2, 44]])
Note These are suggestions that fit your expanded question with clarification and does not fit your earlier question about "clearing" the array. Even in the latter example you could easily say smaller.fill(0) to allay concerns depending on whether you reliably reassign all elements of the array in your iterations.
If you want to keep the array allocated, and with the same size, you don't need to clear the elements. Simply keep track of where you are, and overwrite the values in the array. This is the most efficient way of doing it.
I would simply begin putting the new values into the array.
But if you insist on clearing out the array, try making a new one of the same size using zeros or empty.
>>> A = numpy.array([[1, 2], [3, 4], [5, 6]])
>>> A
array([[1, 2],
[3, 4],
[5, 6]])
>>> A = numpy.zeros(A.shape)
>>> A
array([[ 0., 0.],
[ 0., 0.],
[ 0., 0.]])

Categories

Resources