does numpy asarray() refer to original list - python

I have a very long list of list and I am converting it to a numpy array using numpy.asarray(), is it safe to delete the original list after getting this matrix or does the newly created numpy array will also be affected by this action?

I am pretty sure that data is not shared and that you can safely remove the lists. Your original matrix is a nested structure of Python objects, with the numbers itself also Python objects, which can be located everywhere in memory. A Numpy array is also an object, but it is more or less a header that contains the dimensions and type of the data, with a pointer to a contiguous block of data where all the numbers are packed as close as possible as 'raw numbers'. There is no way how these two different ways could share data, so presumably the data is copied when you create the Numpy array. Example:
In [1]: m = [[1,2,3],[4,5,6],[7,8,9]]
In [2]: import numpy as np
In [3]: M = np.array(m)
In [4]: M[1,1] = 55
In [5]: M
Out[5]:
array([[ 1, 2, 3],
[ 4, 55, 6],
[ 7, 8, 9]])
In [6]: m
Out[6]: [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # original is not modified!
Note that Numpy arrays can share data between each other, e.g. when you make a slice into an array. This is called a 'view', so if you modify data in the subset, it will also change in the original array:
In [18]: P = M[1:, 1:]
In [19]: P[1,1] = 666
In [20]: P
Out[20]:
array([[ 55, 6],
[ 8, 666]])
In [21]: M
Out[21]:
array([[ 1, 2, 3],
[ 4, 55, 6],
[ 7, 8, 666]]) # original is also modified!

The data are copied over because the numpy array stores its own copy of the data as described by Bas Swinckels. You can test this for your self too. Although the trivially small list might make the point too, the ginormous data set below might bring the point home a little better ;)
import numpy as np
list_data = range(1000000000) # note, this will probably take a long time
# This will also take a long time
# because it is copying the data in memory
array_data = np.asarray(list_data)
# even this will probably take a while
del list_data
# But you still have the data even after deleting the list
print(array_data[1000])

Yes, it is safe to delete it if your input data consists of a list. From the documentation No copy is performed (ONLY) if the input is already an ndarray.

Related

Is there any way of getting multiple ranges of values in numpy array at once?

Let's say we have a simple 1D ndarray. That is:
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,10])
I want to get the first 3 and the last 2 values, so that the output would be [ 1 2 3 9 10].
I have already solved this by merging and concatenating the merged variables as follows :
b= a[:2]
c= a[-2:]
a=np.concatenate([b,c])
However I would like to know if there is a more direct way to achieve this using slices, such as a[:2 and -2:] for instance. As an alternative I already tried this :
a = a[np.r_[:2, -2:]]
but it not seems to be working. It returns me only the first 2 values that is [1 2] ..
Thanks in advance!
Slicing a numpy array needs to be continuous AFAIK. The np.r_[-2:] does not work because it does not know how big the array a is. You could do np.r_[:2, len(a)-2:len(a)], but this will still copy the data since you are indexing with another array.
If you want to avoid copying data or doing any concatenation operation you could use np.lib.stride_tricks.as_strided:
ds = a.dtype.itemsize
np.lib.stride_tricks.as_strided(a, shape=(2,2), strides=(ds * 8, ds)).ravel()
Output:
array([ 1, 2, 9, 10])
But since you want the first 3 and last 2 values the stride for accessing the elements will not be equal. This is a bit trickier, but I suppose you could do:
np.lib.stride_tricks.as_strided(a, shape=(2,3), strides=(ds * 8, ds)).ravel()[:-1]
Output:
array([ 1, 2, 3, 9, 10])
Although, this is a potential dangerous operation because the last element is reading outside the allocated memory.
In afterthought, I cannot find out a way do this operation without copying the data somehow. The numpy ravel in the code snippets above is forced to make a copy of the data. If you can live with using the shapes (2,2) or (2,3) it might work in some cases, but you will only have reading permission to a strided view and this should be enforced by setting the keyword writeable=False.
You could try to access the elements with a list of indices.
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,10])
b = a[[0,1,2,8,9]] # b should now be array([ 1, 2, 3, 9, 10])
Obviously, if your array is too long, you would not want to type out all the indices.
Thus, you could build the inner index list from for loops.
Something like that:
index_list = [i for i in range(3)] + [i for i in range(8, 10)]
b = a[index_list] # b should now be array([ 1, 2, 3, 9, 10])
Therefore, as long as you know where your desired elements are, you can access them individually.

how to delete a row or column in numpy array without actually creating a new copy?

I want to delete a particular row or column without actually creating a new copy in python numpy.
Right now i'm doing arr = np.delete(arr, row_or_column_number, axis) but it returns a copy and i have to assign it to it's self everytime.
I was wondering if a more ingenious approached could be used where the change is made to the array itself instead of creating a new copy every time ?
In [114]: x = np.arange(12).reshape(3,4)
In [115]: x.shape
Out[115]: (3, 4)
In [116]: x.ravel()
Out[116]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Do you understand how arrays are stored? Basically there's a flat storage of the elements, much like this ravel, and shape and strides. If not, you need to spend some time reading a numpy tutorial.
Delete makes a new array:
In [117]: y = np.delete(x, 1, 0)
In [118]: y
Out[118]:
array([[ 0, 1, 2, 3],
[ 8, 9, 10, 11]])
In [119]: y.shape
Out[119]: (2, 4)
In [120]: y.ravel()
Out[120]: array([ 0, 1, 2, 3, 8, 9, 10, 11])
This delete is the same as selecting 2 rows from x, x[[0,2],:].
Its data elements are different; it has to copy values from x. Whether you assign that back to x doesn't matter. Variable assignment is a trivial python operation. What matters is how the new array is created.
Now in this particular case it is possible to create a view. This is still a new array, but it share memory with x. That's possible because I am selecting a regular pattern, not an arbitrary subset of the rows or columns.
In [121]: x[0::2,:]
Out[121]:
array([[ 0, 1, 2, 3],
[ 8, 9, 10, 11]])
Again, if view doesn't make sense, you need to read more numpy basics. And don't skip the python basics either.
Unfortunately, you can't do this using numpy. Array scalars are immutable. See documentation.
Link to a related question: How to remove specific elements in a numpy array
Once a numpy array is created, its size is fixed. To delete (or add) a column or row, a new copy needs to be created.
(Even if numpy had an option to drop columns without reassignment, its likely that another copy would still be created. Another library, Pandas, has the option called "inplace" to delete a column from an object without doing any reassignment, but its use is discouraged, and it doesn't literally prevent a copy from being created. For these reasons, it may be deprecated in the future.)

Getting rows corresponding to label, for many labels

I have a 2D array, where each row has a label that is stored in a separate array (not necessarily unique). For each label, I want to extract the rows from my 2D array that have this label. A basic working example of what I want would be this:
import numpy as np
data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
label=np.array([1,1,1,0,1])
#very simple approach
label_values=np.unique(label)
res=[]
for la in label_values:
data_of_this_label_val=data[label==la]
res+=[data_of_this_label_val]
print(res)
The result (res) can have any format, as long as it is easily accessible. In the above example, it would be
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Note that I can easily associate each element in my list to one of the unique labels in label_values (that is, by index).
While this works, using a for loop can take quite a lot of time, especially if my label vector is large. Can this be sped up or coded more elegantly?
You can argsort the labels (which is what unique does under the hood I believe).
If your labels are small nonnegatvie integers as in the example you can get it a bit cheaper, see https://stackoverflow.com/a/53002966/7207392.
>>> import numpy as np
>>>
>>> data=np.array([[1,2],[3,5],[7,10], [20,32],[0,0]])
>>> label=np.array([1,1,1,0,1])
>>>
>>> idx = label.argsort()
# use kind='mergesort' if you require a stable sort, i.e. one that
# preserves the order of equal labels
>>> ls = label[idx]
>>> split = 1 + np.where(ls[1:] != ls[:-1])[0]
>>> np.split(data[idx], split)
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]
Unfortunately, there isn't a built-in groupby function in numpy, though you could write alternatives. However, your problem could be solved more succinctly using pandas, if that's available to you:
import pandas as pd
res = pd.DataFrame(data).groupby(label).apply(lambda x: x.values).tolist()
# or, if performance is important, the following will be faster on large arrays,
# but less readable IMO:
res = [data[i] for i in pd.DataFrame(data).groupby(label).groups.values()]
[array([[20, 32]]), array([[ 1, 2],
[ 3, 5],
[ 7, 10],
[ 0, 0]])]

Appending a new row to a numpy array

I am trying to append a new row to an existing numpy array in a loop. I have tried the methods involving append, concatenate and also vstack none of them end up giving me the result I want.
I have tried the following:
for _ in col_change:
if (item + 2 < len(col_change)):
arr=[col_change[item], col_change[item + 1], col_change[item + 2]]
array=np.concatenate((array,arr),axis=0)
item+=1
I have also tried it in the most basic format and it still gives me an empty array.
array=np.array([])
newrow = [1, 2, 3]
newrow1 = [4, 5, 6]
np.concatenate((array,newrow), axis=0)
np.concatenate((array,newrow1), axis=0)
print(array)
I want the output to be [[1,2,3][4,5,6]...]
The correct way to build an array incrementally is to not start with an array:
alist = []
alist.append([1, 2, 3])
alist.append([4, 5, 6])
arr = np.array(alist)
This is essentially the same as
arr = np.array([ [1,2,3], [4,5,6] ])
the most common way of making a small (or large) sample array.
Even if you have good reason to use some version of concatenate (hstack, vstack, etc), it is better to collect the components in a list, and perform the concatante once.
If you want [[1,2,3],[4,5,6]] I could present you an alternative without append: np.arange and then reshape it:
>>> import numpy as np
>>> np.arange(1,7).reshape(2, 3)
array([[1, 2, 3],
[4, 5, 6]])
Or create a big array and fill it manually (or in a loop):
>>> array = np.empty((2, 3), int)
>>> array[0] = [1,2,3]
>>> array[1] = [4,5,6]
>>> array
array([[1, 2, 3],
[4, 5, 6]])
A note on your examples:
In the second one you forgot to save the result, make it array = np.concatenate((array,newrow1), axis=0) and it works (not exactly like you want it but the array is not empty anymore). The first example seems badly indented and without know the variables and/or the problem there it's hard to debug.

Can I avoid using `asmatrix`?

Is there any way for me to create matrices directly and not have to use asmatrix? From what I can see, all of the typical matrix functions (ones, rand, etc) in Numpy return arrays, not matrices, which means (according to the documentation) that asmatrix will copy the data. Is there any way to avoid this?
According to the documentation:
Unlike matrix, asmatrix does not make a copy if the input is already a
matrix or an ndarray. Equivalent to matrix(data, copy=False).
So, asmatrix does not copy the data if it doesn't need to:
>>> import numpy as np
>>> a = np.arange(9).reshape((3,3))
>>> b = np.asmatrix(a)
>>> b.base is a
True
>>> a[0] = 3
>>> b
matrix([[3, 3, 3],
[3, 4, 5],
[6, 7, 8]])

Categories

Resources