I have a set of data that I would like to treat with numpy. The data can be looked at as a set of points in space with an additional property variable that I would like to handle as an object. Depending on a set of data, the vectors may be of length 1, 2, or 3, but is the same length for all points in a given set of data. The property object is a custom class that may be the same for any two given points.
So consider this data as a random example (C and H represent objects that contain atomic properties for Carbon or Hydrogen ... or just some random object). These will not be read in through a file, but created by an algorithm. Here the C object may be the same or it may be different (isotope for example).
Example 3D data set (just abstract representation)
C 1 2 3
C 3 4 5
H 1 1 4
I would like to have a numpy array that contains all of the atomic positions, so that I can perform numpy operations like vector manipulation and such as a translation function def translate(data,vec):return data + vec. I would also like to handle the property objects in parallel. One option would be to have two separate arrays for both, but if I delete an element of one, I would have to explicitly delete the property array value as well. This could get difficult to handle.
I considered using numpy.recarray
x = np.array([(1.0,2,3, "C"), (3.0,2,3, "H")], dtype=[('x', "float64" ),('y',"float6
4"),('z',"float64"), ('type', object)])
But it seems the shape of this array is (2,), which means that each record is handled independently. Also, I cannot seem to understand how to get vector manipulation to work with this type:
def translate(data,vec):return data + vec
translate(x,np.array([1,2,3]))
...
TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray'
Is numpy.recarray what I should be using? Is there a better way to handle this in a simpler way such that I have a separate numerical matrix of points with a parallel object array that are linked in case an element is removed (np.delete)? I also briefly considered writing an array object that extends ndarray, but I feel like this may be unnecessary and potentially disastrous.
Any thoughts or suggestions would be very helpful.
The field of a recarray can be a ndarray, if you pass the tuple (name, type, shape) as the dtype of the field:
In [9]:
import numpy as np
x = np.array([((1.0,2,3), "C"), ((3.0,2,3), "H")], dtype=[('xyz', "float64", (3,)), ('type', object)])
In [11]:
np.delete(x, 0)
Out[11]:
array([([3.0, 2.0, 3.0], 'H')],
dtype=[('xyz', '<f8', (3,)), ('type', 'O')])
In [12]:
x["xyz"]
Out[12]:
array([[ 1., 2., 3.],
[ 3., 2., 3.]])
In [14]:
x["xyz"] + (10, 20, 30)
Out[14]:
array([[ 11., 22., 33.],
[ 13., 22., 33.]])
For your translate function:
def translate(data,vec):
tmp = data.copy()
tmp["xyz"] += vect
return tmp
If you want more flexible functions, you may consider using Pandas.DataFrame.
If you are dealing with collections of atoms, you may consider to use the Atoms class from Atomic Simulation Environment (ASE). It stores atom types, positions and has list-like methods to manipulate them.
One quick and dirty way would be to set the last (or indeed any) column to be a numerical lookup to a labels dictionary:
>>> import numpy
>>> labels = ['H', 'C', 'O']
>>> labels_refs = dict(zip(labels, numpy.arange(len(labels), dtype='float64')))
>>> reverse_labels_refs = dict(zip(numpy.arange(len(labels), dtype='float64'), labels))
>>> x = numpy.array([
... [1.0,2,3, labels_refs['C']],
... [3.0,2,3, labels_refs['H']],
... [2.0,2,3, labels_refs['C']]])
>>> x
array([[ 1., 2., 3., 1.],
[ 3., 2., 3., 0.],
[ 2., 2., 3., 1.]])
>>> extract_refs = numpy.vectorize(
... lambda label_ref: reverse_labels_refs[label_ref])
>>> labels = extract_refs(x[:, -1]) # Turn the last column back into labels
>>> labels
array(['C', 'H', 'C'],
dtype='|S8')
You can also lookup rows by their labels (as an example):
>>> x[numpy.where(x[:,-1] == labels_refs['C']), :-1]
array([[[ 1., 2., 3.],
[ 2., 2., 3.]]])
Related
numpy.vectorize takes a function f:a->b and turns it into g:a[]->b[].
This works fine when a and b are scalars, but I can't think of a reason why it wouldn't work with b as an ndarray or list, i.e. f:a->b[] and g:a[]->b[][]
For example:
import numpy as np
def f(x):
return x * np.array([1,1,1,1,1], dtype=np.float32)
g = np.vectorize(f, otypes=[np.ndarray])
a = np.arange(4)
print(g(a))
This yields:
array([[ 0. 0. 0. 0. 0.],
[ 1. 1. 1. 1. 1.],
[ 2. 2. 2. 2. 2.],
[ 3. 3. 3. 3. 3.]], dtype=object)
Ok, so that gives the right values, but the wrong dtype. And even worse:
g(a).shape
yields:
(4,)
So this array is pretty much useless. I know I can convert it doing:
np.array(map(list, a), dtype=np.float32)
to give me what I want:
array([[ 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1.],
[ 2., 2., 2., 2., 2.],
[ 3., 3., 3., 3., 3.]], dtype=float32)
but that is neither efficient nor pythonic. Can any of you guys find a cleaner way to do this?
np.vectorize is just a convenience function. It doesn't actually make code run any faster. If it isn't convenient to use np.vectorize, simply write your own function that works as you wish.
The purpose of np.vectorize is to transform functions which are not numpy-aware (e.g. take floats as input and return floats as output) into functions that can operate on (and return) numpy arrays.
Your function f is already numpy-aware -- it uses a numpy array in its definition and returns a numpy array. So np.vectorize is not a good fit for your use case.
The solution therefore is just to roll your own function f that works the way you desire.
A new parameter signature in 1.12.0 does exactly what you what.
def f(x):
return x * np.array([1,1,1,1,1], dtype=np.float32)
g = np.vectorize(f, signature='()->(n)')
Then g(np.arange(4)).shape will give (4L, 5L).
Here the signature of f is specified. The (n) is the shape of the return value, and the () is the shape of the parameter which is scalar. And the parameters can be arrays too. For more complex signatures, see Generalized Universal Function API.
import numpy as np
def f(x):
return x * np.array([1,1,1,1,1], dtype=np.float32)
g = np.vectorize(f, otypes=[np.ndarray])
a = np.arange(4)
b = g(a)
b = np.array(b.tolist())
print(b)#b.shape = (4,5)
c = np.ones((2,3,4))
d = g(c)
d = np.array(d.tolist())
print(d)#d.shape = (2,3,4,5)
This should fix the problem and it will work regardless of what size your input is. "map" only works for one dimentional inputs. Using ".tolist()" and creating a new ndarray solves the problem more completely and nicely(I believe). Hope this helps.
You want to vectorize the function
import numpy as np
def f(x):
return x * np.array([1,1,1,1,1], dtype=np.float32)
Assuming that you want to get single np.float32 arrays as result, you have to specify this as otype. In your question you specified however otypes=[np.ndarray] which means you want every element to be an np.ndarray. Thus, you correctly get a result of dtype=object.
The correct call would be
np.vectorize(f, signature='()->(n)', otypes=[np.float32])
For such a simple function it is however better to leverage numpy's ufunctions; np.vectorize just loops over it. So in your case just rewrite your function as
def f(x):
return np.multiply.outer(x, np.array([1,1,1,1,1], dtype=np.float32))
This is faster and produces less obscure errors (note however, that the results dtype will depend on x if you pass a complex or quad precision number, so will be the result).
I've written a function, it seems fits to your need.
def amap(func, *args):
'''array version of build-in map
amap(function, sequence[, sequence, ...]) -> array
Examples
--------
>>> amap(lambda x: x**2, 1)
array(1)
>>> amap(lambda x: x**2, [1, 2])
array([1, 4])
>>> amap(lambda x,y: y**2 + x**2, 1, [1, 2])
array([2, 5])
>>> amap(lambda x: (x, x), 1)
array([1, 1])
>>> amap(lambda x,y: [x**2, y**2], [1,2], [3,4])
array([[1, 9], [4, 16]])
'''
args = np.broadcast(None, *args)
res = np.array([func(*arg[1:]) for arg in args])
shape = args.shape + res.shape[1:]
return res.reshape(shape)
Let try
def f(x):
return x * np.array([1,1,1,1,1], dtype=np.float32)
amap(f, np.arange(4))
Outputs
array([[ 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1.],
[ 2., 2., 2., 2., 2.],
[ 3., 3., 3., 3., 3.]], dtype=float32)
You may also wrap it with lambda or partial for convenience
g = lambda x:amap(f, x)
g(np.arange(4))
Note the docstring of vectorize says
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Thus we would expect the amap here have similar performance as vectorize. I didn't check it, Any performance test are welcome.
If the performance is really important, you should consider something else, e.g. direct array calculation with reshape and broadcast to avoid loop in pure python (both vectorize and amap are the later case).
The best way to solve this would be to use a 2-D NumPy array (in this case a column array) as an input to the original function, which will then generate a 2-D output with the results I believe you were expecting.
Here is what it might look like in code:
import numpy as np
def f(x):
return x*np.array([1, 1, 1, 1, 1], dtype=np.float32)
a = np.arange(4).reshape((4, 1))
b = f(a)
# b is a 2-D array with shape (4, 5)
print(b)
This is a much simpler and less error prone way to complete the operation. Rather than trying to transform the function with numpy.vectorize, this method relies on NumPy's natural ability to broadcast arrays. The trick is to make sure that at least one dimension has an equal length between the arrays.
I have record array with 2×2 fixed-size item, with 10 rows; thus the column is 10×2x2. I would like to assign a constant to the whole column. Numpy array will broadcast scalar value correctly, but this does not work in h5py.
import numpy as np
import h5py
dt=np.dtype([('a',('f4',(2,2)))])
# h5py array
h5a=h5py.File('/tmp/t1.h5','w')['/'].require_dataset('test',dtype=dt,shape=(10,))
# numpy for comparison
npa=np.zeros((10,),dtype=dt)
h5a['a']=np.nan
# ValueError: changing the dtype of a 0d array is only supported if the itemsize is unchanged
npa['a']=np.nan
# numpy: broadcasts, OK
In fact, I can't find a way to assign the column without broadcasting:
h5a['a']=np.full((10,2,2),np.nan)
# ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array
Not even one element row:
h5a['a',0]=np.full((2,2),np.nan)
# ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array
What is the problem here?
In [69]: d = f.create_dataset('test', dtype=dt, shape=(3,))
We can set a like sized array:
In [90]: x=np.ones(3,dt)
In [91]: x[:]=2
In [92]: x
Out[92]:
array([([[2., 2.], [2., 2.]],), ([[2., 2.], [2., 2.]],),
([[2., 2.], [2., 2.]],)], dtype=[('a', '<f4', (2, 2))])
and assign it to the dataset:
In [93]: d[:]=x
In [94]: d
Out[94]: <HDF5 dataset "test": shape (3,), type "|V16">
In [95]: d[:]
Out[95]:
array([([[2., 2.], [2., 2.]],), ([[2., 2.], [2., 2.]],),
([[2., 2.], [2., 2.]],)], dtype=[('a', '<f4', (2, 2))])
We can also make a single element array with the correct dtype, and assign that:
In [116]: x=np.array((np.arange(4).reshape(2,2),),dt)
In [117]: x
Out[117]: array(([[0., 1.], [2., 3.]],), dtype=[('a', '<f4', (2, 2))])
In [118]: d[0]=x
With h5py we can index with record and field as:
In [119]: d[0,'a']
Out[119]:
array([[0., 1.],
[2., 3.]], dtype=float32)
Where as ndarray requires a double index as with: d[0]['a']
h5py tries to imitate ndarray indexing, but is not exactly the same. We just have to accept that.
edit
The [118] assignment can also be
In [207]: d[1,'a']=x
The dt here just as one field, but I think this should work with multiple fields. The key is that the value has to be a structured array that matches the d field specification.
I just noticed in the docs that they are trying to move away from the d[1,'a'] indexing, instead using d[1]['a']. But for assignment that doesn't seem to work - not error, just no action. I think d[1] or d['a'] is a copy, the equivalent of a advanced indexing for arrays. For a structured arrays those are view.
I'm trying to use some sklearn estimators for classifications on the coefficients of some fast fourier transform (technically Discrete Fourier Transform). I obtain a numpy array X_c as output of np.fft.fft(X) and I want to transform it into a real numpy array X_r, with each (complex) column of the original X_c transformed into two (real/float) columns in X_r, i.e the shape goes from (r, c) to (r, 2c). So I use .view(np.float64). and it works at first.
The problem is that if I first decide to keep only some coefficients of the original complex array with X_c2 = X_c[:, range(3)] and then to do the same thing as before instead of having the number of columns doubled, I obtain the number of ranks doubled (the imaginary part of each element is put in a new row below the original).
I really don't understand why this happens.
To make myself clearer, here is a toy example:
import numpy as np
# I create a complex array
X_c = np.arange(8, dtype = np.complex128).reshape(2, 4)
print(X_c.shape) # -> (2, 4)
# I use .view to transform it into something real and it works
# the way I want it.
X_r = X_c.view(np.float64)
print(X_r.shape) # -> (2, 8)
# Now I subset the array.
indices_coef = range(3)
X_c2 = X_c[:, indices_coef]
print(X_c2.shape) # -> (2, 3)
X_r2 = X_c2.view(np.float64)
# In the next line I obtain (4, 3), when I was expecting (2, 6)...
print(X_r2.shape) # -> (4, 3)
Does anyone see a reason for this difference of behavior?
I get a warning:
In [5]: X_c2 = X_c[:,range(3)]
In [6]: X_c2
Out[6]:
array([[ 0.+0.j, 1.+0.j, 2.+0.j],
[ 4.+0.j, 5.+0.j, 6.+0.j]])
In [7]: X_c2.view(np.float64)
/usr/local/bin/ipython3:1: DeprecationWarning: Changing the shape of non-C contiguous array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
#!/usr/bin/python3
Out[7]:
array([[ 0., 1., 2.],
[ 0., 0., 0.],
[ 4., 5., 6.],
[ 0., 0., 0.]])
In [12]: X_c2.strides
Out[12]: (16, 32)
In [13]: X_c2.flags
Out[13]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
So this copy (or is a view?) is Fortran order. The recommended X_c2.T.view(float).T produces the same 4x3 array without the warning.
As your first view shows, a complex array has the same data layout as twice the number of floats.
I've seen funny shape behavior when trying to view a structured array. I'm wondering the complex dtype is behaving much like a dtype('f8,f8') array.
If I change your X_c2 so it is a copy, I get the expected behavior
In [19]: X_c3 = X_c[:,range(3)].copy()
In [20]: X_c3.flags
Out[20]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [21]: X_c3.strides
Out[21]: (48, 16)
In [22]: X_c3.view(float)
Out[22]:
array([[ 0., 0., 1., 0., 2., 0.],
[ 4., 0., 5., 0., 6., 0.]])
That's reassuring. But I'm puzzled as to why the [:, range(3)] indexing creates a F order view. That should be advance indexing.
And indeed, a true slice does not allow this view
In [28]: X_c[:,:3].view(np.float64)
---------------------------------------------------------------------------
ValueError: new type not compatible with array.
So the range indexing has created some sort of hybrid object.
I have a numpy array(eg., a = np.array([ 8., 2.])), and another array which stores the indices I would like to get from the former array. (eg., b = np.array([ 0., 1., 1., 0., 0.]).
What I would like to do is to create another array from these 2 arrays, in this case, it should be: array([ 8., 2., 2., 8., 8.])
of course, I can always use a for loop to achieve this goal:
for i in range(5):
c[i] = a[b[i]]
I wonder if there is a more elegant method to create this array. Something like c = a[b[0:5]] (well, this apparently doesn't work)
Only integer arrays can be used for indexing, and you've created b as a float64 array. You can get what you're looking for if you explicitly convert to integer:
bi = np.array(b, dtype=int)
c = a[bi[0:5]]
I can't figure out how to use the CArray trait. Why does this class
from traits.api import HasTraits, CArray, Float,Int
import numpy as np
class Coordinate3D(HasTraits):
coordinate = CArray(Float(), shape=(1,3) )
def _coordinate_default(self):
return np.array([1.,2.,3.])
apparently not use my _name_default() method?
In [152]: c=Coordinate3D()
In [153]: c.coordinate
Out[153]: np.array([[ 0., 0., 0.]])
I would have expected np.array([1,2,3])! The _name_default() seems to work with Int
class A(HasTraits):
a=Int
def _a_default(self):
return 2
In [163]: a=A()
In [164]: a.a
Out[164]: 2
So what I am doing wrong here? Also, I can't assign values:
In [181]: c.coordinate=[1,2,3]
TraitError: The 'coordinate' trait of a Coordinate3D instance must be an array of
float64 values with shape (1, 3), but a value of array([ 1., 2., 3.]) <type
'numpy.ndarray'> was specified.
Same error message with
In [182]: c.coordinate=np.array([1,2,3])
There is a difference between one-dimensional arrays and two-dimensional arrays in which one of the dimensions has size 1. You are trying to set a 1-D array into a CArray trait expecting two dimensions. For example, your default method should be:
def _coordinate_default(self):
return np.array([[1., 2., 3.]])
(note the extra square brackets). The array you were setting is of shape (3,), not the desired (1, 3).
Similarly, it will not coerce a flat list into a 2-D array. Try assigning a nested list like
c.coordinate=[[1, 2, 3]]
instead.
(Alternatively, if you actually want 1-D arrays, you should use shape=(3,) in your traits assignment and the other parts should work correctly.)
Dummy me. While copy-pasting from Eclipse to iPython, I didn't use the magic %paste function and messed up the class definition there. The other actual error was the shape of the CArray which must be (3,).
This code
class Coordinate3D(HasTraits):
coordinate = CArray(Float(),shape=(3,))
def __init__(self,iv=None):
super(Coordinate3D,self).__init__()
if iv:
self.coordinate=iv
def _coordinate_default(self):
return array([1,2,3])
def __getitem__(self,index):
return self.coordinate[index]
works like intended:
In [3]: c=Coordinate3D()
In [6]: c.coordinate
Out[6]: array([ 1., 2., 3.])
In [7]: c=Coordinate3D([1,2,5])
In [8]: c.coordinate
Out[8]: array([ 1., 2., 5.])
In [11]: c[0]
Out[11]: 1.0
In extension to the previous answers, I experimented further:
import types
RealNumberType = (types.IntType, types.LongType, types.FloatType)
class ScaleFactor3D(Coordinate3D):
'''Demonstrate subclassing a HasTraits class
and overriding __init__ and a _default method'''
def _coordinate_default(self):
return array([1,1,1])
def __init__(self,iv=None):
if isinstance(iv,RealNumberType):
iv=[iv,iv,iv]
super(ScaleFactor3D,self).__init__(iv)
This works well too:
In [35]: s=ScaleFactor3D()
In [36]: s.coordinate
Out[36]: array([ 1., 1., 1.])
In [37]: s=ScaleFactor3D(3)
In [38]: s.coordinate
Out[38]: array([ 3., 3., 3.])
I thought I'd put this here since I couldn't find much useful information on CArray on the web.