How does numpy array typing interact with object? - python

I am currently trying to implement a datatype that stores floats in an numpy array. However trying to assign an array with elements of this type with various lengths seems to obviously break the code. One would assign a sequence to an array element, which is not possible.
One can bypass this by using the data type object instead of float. Why is that? How could one resolve this problem using floats without creating a sequence?
Example code that does not work.
from numpy import *
foo= dtype(float32, [])
x = array([[2., 3.], [3.]], dtype=foo)
Example code that does work:
from numpy import *
foo= dtype(float32, [])
x = array([[2., 3.], [3., 2.]], dtype=foo)
Example code that does work, I try to replicate for float:
from numpy import *
foo= dtype(object, [])
x = array([[2., 3.], [3.]], dtype=foo)

The object dtype in Numpy simply creates an array of pointers to Python objects. This means you lose the performance advantage you usually get from Numpy, but it's still sometimes useful to do this.
Your last example creates a one-dimensional Numpy array of length two, so that's two pointers to Python objects. Both these objects happen to be lists, and Python list have arbitrary dynamic length.

I don't know what you were trying to achieve with this, but note that
>>> np.dtype(np.float32, []) == np.float32
True
Arrays require the same number of elements for each row. So, if you feed a list of lists in numpy and all sublists have the same number of elements, it'll happily convert it to an array. This is why your second example works.
If the sublists are not the same length, then each sublist is treated as a single object and you end up with a 1D array of objects. This is why your third example works. Your first example doesn't work because you try to cast a sequence of objects to floats, which isn't possible.
In short, you can't create an array of floats if your sublists are of different lengths. At best, you can create an array of 1D arrays, since they are still considered objects.
>>> x = np.array(list(map(np.array, [[2., 3.], [3.]])))
>>> x
array([array([ 2., 3.]), array([ 3.])], dtype=object)
>>> x[0]
array([ 2., 3.])
>>> x[0][1]
3.0
>>> # but you can't do this
>>> x[0,1]
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
x[0,1]
IndexError: too many indices for array
If you're bent on creating a float 2D array, you have to extend all your sublists to the same size with None, which will be converted to np.nan.
>>> lists = [[2., 3.], [3.]]
>>> max_len = max(map(len, lists))
>>> for i, sublist in enumerate(lists):
sublist = sublist + [None] * (max_len - len(sublist))
lists[i] = sublist
>>> np.array(lists, dtype=np.float32)
array([[ 2., 3.],
[ 3., nan]], dtype=float32)

Related

Calculate the mean over a mixed data structure

I have a list of lists that looks something like:
data = [
[1., np.array([2., 3., 4.]), ...],
[5., np.array([6., 7., 8.]), ...],
...
]
where each of the internal lists is the same length and contains the same data type/shape at each entry. I would like to calculate the mean over corresponding entries and return something of the same structure as the internal lists. For example, in the above case (assuming only two entries) I want the result to be:
[3., np.array([4., 5., 6.]), ...]
What is the best way to do this with Python?
data is a list, so a list comprehension seems like a natural option. Even if it were a numpy array, given that it's a jagged array, it wouldn't benefit from being wrapped in an ndarray anyway, so a list comp would still be the best option, in my opinion.
Anyway, use zip() to "transpose" data and call np.mean() in a loop to find mean along the first axis.
[np.mean(x, axis=0) for x in zip(*data)]
# [3.0, array([4., 5., 6.]), array([[2., 2.], [2., 2.]])]
if you have a list exactly the same as the one shown in the example, you can do it with the following code.
First we declare some variables to store our results:
number_sum = 0
list_sum = np.array([0,0,0])
It is important that you initialize the values ​​you need to 0 in list_sum. That is, if the data array contains 5 elements, that array should be: list_sum = np.array([0,0,0,0,0]).
The next step is to perform the sum of all elements in data. First we add the int values ​​and then we perform the sum of each element of the list as follows:
for number, nparray in data:
number_sum += number
for index, item in enumerate(nparray):
list_sum[index] += item
Since we know how the variable data is structured (each input is made up of an int value and an np.array) we can do the addition that way. Although be careful with the computational complexity because in examples with longer arrays it could become very high in terms of complexity, since two for loops are being nested.
Finally, you can check that if you divide the sum of the elements by the length of data you get the desired value:
print(number_sum/len(data))
print(list_sum/len(data))
Now you just have to add those two new values ​​to a new list. I hope it helps, greetings and good luck!
The following works:
import numpy as np
data = [
[1., np.array([2., 3., 4.]), np.array([[1., 1.], [1., 1.]])],
[5., np.array([6., 7., 8.]), np.array([[3., 3.], [3., 3.]])],
]
number_of_samples = len(data)
number_of_elements = len(data[0])
means = []
for ielement in range(number_of_elements):
mean_list = []
for isample in range(number_of_samples):
mean_list.append(data[isample][ielement])
mean_list = np.stack(mean_list)
mean = mean_list.mean(axis=0)
means.append(mean)
print(means)
but is a bit ugly, nests a for loops, and does not seem to be very pythonic. Any improvements over this are welcomed.

Deleting multiple elements at once from a numpy 2d array

Is there a way to delete from a numpy 2d array when I have the indexes? For example:
a = np.random.random((4,5))
idxs = [(0,1), (1,3), (2, 1), (3,4)]
I want to remove the indexes specified above. I tried:
np.delete(a, idxs)
but it just removes the top row.
To give an example, for the following input:
[
[0.15393912, 0.08129568, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.60279229, 0.41168151],
[0.06330729, 0.60704682, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462, 0.07293321]
]
and with the indexes as mentioned above, I want the result to be:
[
[0.15393912, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.41168151],
[0.06330729, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462]
]
your index should be for flat array else it only works to remove a row or column.
Here is how you can convert index and use it
arr = np.array([
[0.15393912, 0.08129568, 0.34958515, 0.21266128, 0.92372852],
[0.42450441, 0.1027468 , 0.13050591, 0.60279229, 0.41168151],
[0.06330729, 0.60704682, 0.5340644 , 0.47580567, 0.42528617],
[0.27122323, 0.42713967, 0.94541073, 0.21462462, 0.07293321]
])
idxs = [(0,1), (1,3), (2, 1), (3,4)]
idxs = [i*arr.shape[1]+j for i, j in idxs]
np.delete(arr, idxs).reshape(4,4)
for reshaping you should remove the items such that there will be equal number of items and rows and columns after deletion
Numpy doesn't know that you are removing exactly one element per row when you give it arbitrary indices like that. Since you do know that, I would suggest using a mask to shrink the array. Masking has the same problem: it doesn't assume anything about the shape of the result (because it can't in general), and returns a raveled array. You can reinstate the shape you want quite easily though. In fact, I would suggest removing the first element of each index entirely, since you have one per row:
def remove_indices(a, idx):
if len(idx) != len(idx): raise ValueError('Wrong number of indices')
mask = np.ones(a.size, dtype=np.bool_)
mask[np.arange(len(idx)), idx] = False
return a[mask].reshape(a.shape[0], a.shape[1] - 1)
Here is a method using np.where
import numpy as np
import operator as op
a = np.arange(20.0).reshape(4,5)
idxs = [(0,1), (1,3), (2, 1), (3,4)]
m,n = a.shape
# extract column indices
# there are simpler ways but this is fast
columns = np.fromiter(map(op.itemgetter(1),idxs),int,m)
# build decimated array
result = np.where(columns[...,None]>np.arange(n-1),a[...,:-1],a[...,1:])
result
# array([[ 0., 2., 3., 4.],
# [ 5., 6., 7., 9.],
# [10., 12., 13., 14.],
# [15., 16., 17., 18.]])
As the documentation says
Return a new array with sub-arrays along an axis deleted.
np.delete deletes a row or a column based on the value of the parameter axis.
Secondly np.delete expects int or array of ints as parameter and not a list of tuples.
you need to specify what the requirement is.
As #divakar suggested look at other answers on Stackoverflow regarding deleting individual items in numpy array.

Adding arrays which may contain 'None'-entries

I have a question regarding the addition of numpy arrays.
Let's assume I have defined a function
def foo(a,b):
return a+b
that takes two arrays of the same shape and simply returns their sum.
Now, I have to deal with the cases that some of the entries may be None.
I would like to deal with those entries as they correspond to float(0), such that
[1.0,None,2.0] + [1.0,2.0,2.0]
would add up to
[2.0,2.0,4.0]
Can you provide me with an already-implemented solution?
TIA
I suggest numpy.nan_to_num:
>>> np.nan_to_num(np.array([1.0,None,2.0], dtype=np.float))
array([ 1., 0., 2.])
Then,
>>> def foo(a,b):
... return np.nan_to_num(a) + np.nan_to_num(b)
...
>>> foo(np.array([1.0,None,2.0], dtype=np.float), np.array([1.0,2.0,2.0], dtype=np.float))
array([ 2., 0., 4.])
Usually, the answer to this is to use an array of floats, rather than an array of arbitrary objects, and then use np.nan instead of None. NaN has well-defined semantics for arithmetic. (Also, using an array of floats instead of objects will make your code significantly more time and space efficient.)
Notice that you don't have to manually convert None to np.nan if you build the array with an explicit dtype of float or np.float64. Both of these are equivalent:
>>> a = np.array([1.0,np.nan,2.0])
>>> a = np.array([1.0,None,2.0],dtype=float)
Which means that if, for some reason, you really needed arrays of arbitrary objects with actual None in them, you could do that, and then convert it to an array of floats on the fly to get the benefits of NaN:
>>> a.astype(float) + b.astype(float)
At any rate, in this case, just using NaN isn't sufficient:
>>> a = np.array([1.0,np.nan,2.0])
>>> b = np.array([1.0,2.0,2.0])
>>> a + b
array([ 2., nan, 4.])
That's because the semantics of NaN are that the result of any operation with NaN returns NaN. But you want to treat it as 0.
But it does make the problem easy to solve. The simplest way to solve that is with the function nan_to_num:
>>> np.nan_to_num(a, 0)
array([1., 0., 2.0])
>>> np.nan_to_num(a, 0) + np.nan_to_num(b, 0)
array([2., 2., 4.])
You can use column_stack to concatenates both arrays along the second axis then use np.nansum() to sum items over the second axis.
In [15]: a = np.array([1.0,None,2.0], dtype=np.float)
# Using dtype here is necessary to convert None to np.nan
In [16]: b = np.array([1.0,2.0,2.0])
In [17]: np.nansum(np.column_stack((a, b)), 1)
Out[17]: array([2., 2., 4.])

Efficient way to drop a column from a Numpy array?

If I have a very large numpy array with one useless column, how could I drop it without creating a copy of the original array?
np.delete(my_np_array, 0, 1)
The above code will return a copy of the array without the zero-th column. But instead I would like to simply delete that column from my_np_array since I don't need it. For very large datasets, the memory management becomes important and copying may not be an option.
If memory is the main concern, what you can do is move columns around within your array such that the unneeded column gets at the very end of your array, then use ndarray.resize, which modifies he array in-place, to shrink it down and discard the outer column.
You cannot simply remove the first column of an array in-place using the provided API, and I suspect it is because of the memory layout of an ndarray that maps multidimensional indexing to unidimensional byte-oriented addressing within blocks of contiguous memory.
The following example copies the last column into the first and then deletes the last (now unneeded), immediately purging the associated memory. So it basically removes the obsolete column from memory completely, at the cost of changing your column order.
D1, D2 = A.shape
A[:, 0] = A[:, D2-1]
A.resize((D1, D2-1), refcheck=False)
A.shape
# => would be (5, 4) if the shape was initially (5, 5) for example
If you use slicing numpy won't make a copy; in other words
a = numpy.array([1, 2, 3, 4, 5])
b = a[1:] # view elements from second to last, NOT making a copy
b[0] = 12 # Change first element of `b`, i.e. second of `a`
print a
will reply [1, 12, 3, 4, 5]
If you need to delete an element in the middle however a single slicing won't work.
Numpy arrays are immutable. So they can't be re-sized without creating a intermediate copy.
How to remove specific elements in a numpy array
Creating a view with slicing, and make a copy of that is probably the fastest you can do.
In [804]: a = np.ones((2,2))
In [805]: a
Out[805]:
array([[ 1., 1.],
[ 1., 1.]])
In [806]: np.resize(a,(3,2))
Out[806]:
array([[ 1., 1.],
[ 1., 1.],
[ 1., 1.]])
In [807]: a <- a should now be resized if it was done inplace?
Out[807]:
array([[ 1., 1.],
[ 1., 1.]])

Numpy, a 2 rows 1 column file, loadtxt() returns 1row 2 columns

2.765334406984874427e+00
3.309563282821381680e+00
The file looks like above: 2 rows, 1 col
numpy.loadtxt() returns
[ 2.76533441 3.30956328]
Please don't tell me use array.transpose() in this case, I need a real solution. Thank you in advance!!
You can always use the reshape command. A single column text file loads as a 1D array which in numpy's case is a row vector.
>>> a
array([ 2.76533441, 3.30956328])
>>> a[:,None]
array([[ 2.76533441],
[ 3.30956328]])
>>> b=np.arange(5)[:,None]
>>> b
array([[0],
[1],
[2],
[3],
[4]])
>>> np.savetxt('something.npz',b)
>>> np.loadtxt('something.npz')
array([ 0., 1., 2., 3., 4.])
>>> np.loadtxt('something.npz').reshape(-1,1) #Another way of doing it
array([[ 0.],
[ 1.],
[ 2.],
[ 3.],
[ 4.]])
You can check this using the number of dimensions.
data=np.loadtxt('data.npz')
if data.ndim==1: data=data[:,None]
Or
np.loadtxt('something.npz',ndmin=2) #Always gives at at least a 2D array.
Although its worth pointing out that if you always have a column of data numpy will always load it as a 1D array. This is more of a feature of numpy arrays rather then a bug I believe.
If you like, you can use matrix to read from string. Let test.txt involve the content. Here's a function for your needs:
import numpy as np
def my_loadtxt(filename):
return np.array(np.matrix(open(filename).read().strip().replace('\n', ';')))
a = my_loadtxt('test.txt')
print a
It gives column vectors if the input is a column vector. For the row vectors, it gives row vectors.
You might want to use the csv module:
import csv
import numpy as np
reader = csv.reader( open('file.txt') )
l = list(reader)
a = np.array(l)
a.shape
>>> (2,1)
This way, you will get the correct array dimensions irrespective of the number of rows / columns present in the file.
I've written a wrapper for loadtxt to do this and is similar to answer from #petrichor, but I think matrix can't have a string data format (probably understandably) so and that method doesn't seem to work if you're loading strings (such as column headings).
def my_loadtxt(filename, skiprows=0, usecols=None, dtype=None):
d = np.loadtxt(filename, skiprows=skiprows, usecols=usecols, dtype=dtype, unpack=True)
if len(d.shape) == 0:
d = d.reshape((1, 1))
elif len(d.shape) == 1:
d = d.reshape((d.shape[0], 1))
return d

Categories

Resources