Related
I have to mask/filter a structured numpy array if the rows are present in another structured numpy array.
import numpy as np
dt = {'names':['A', 'B', 'C'],
'formats': [np.int64, np.int64, np.dtype('U8')]}
arr = np.array([
(1, 100, 'ab'),
(2, 800, 'ax'),
(3, 700, 'asb'),
(4, 100, 'ab'),
(5, 500, 'hfg')
], dtype = dt)
dt2 = {'names':['D', 'E', 'F'],
'formats': [np.int64, np.dtype('U8'), np.dtype('U8')]}
arr2 = np.array([
(100, 'ab', 'cff'),
(100, 'cd', 'sdf'),
(500, 'hfg', 'cff'),
(500, 'xx', 'asd')
], dtype = dt2)
print(arr)
print(arr2)
arr3 = arr[np.isin(arr[['B','C']], arr2[['D', 'E']])]
print(arr3)
Element wise comparison failed on this program and a empty result is returned. The result I'm expecting is -
[(1,100,'ab'), (4, 100, 'ab'), (5, 500, 'hfg')]
The corresponding columns must match. Comparing the columns separately will give wrong results. Is there a way to perform this in numpy. Please don't suggest Pandas or other library based solutions.
The actual error, which you forgot to post (why?):
In [105]: arr[np.isin(arr[['B','C']], arr2[['D', 'E']])]
Traceback (most recent call last):
Input In [105] in <cell line: 1>
arr[np.isin(arr[['B','C']], arr2[['D', 'E']])]
File <__array_function__ internals>:180 in isin
File /usr/local/lib/python3.10/dist-packages/numpy/lib/arraysetops.py:739 in isin
return in1d(element, test_elements, assume_unique=assume_unique,
File <__array_function__ internals>:180 in in1d
File /usr/local/lib/python3.10/dist-packages/numpy/lib/arraysetops.py:612 in in1d
mask |= (ar1 == a)
TypeError: Cannot compare structured arrays unless they have a common dtype. I.e. `np.result_type(arr1, arr2)` must be defined.
isin requires the ability to compare the elements of the arrays. With different dtypes that's impossible.
In [106]: arr[['B','C']].dtype, arr2[['D', 'E']].dtype
Out[106]:
(dtype({'names': ['B', 'C'], 'formats': ['<i8', '<U8'], 'offsets': [8, 16], 'itemsize': 48}),
dtype({'names': ['D', 'E'], 'formats': ['<i8', '<U8'], 'offsets': [0, 8], 'itemsize': 72}))
Just for reference
In [107]: arr.dtype, arr2.dtype
Out[107]:
(dtype([('A', '<i8'), ('B', '<i8'), ('C', '<U8')]),
dtype([('D', '<i8'), ('E', '<U8'), ('F', '<U8')]))
It does work if I repack_fields and rename_fields:
In [114]: np.isin(rf.repack_fields(arr[['B','C']]),
rf.rename_fields(rf.repack_fields(arr2[['D', 'E']]),{'D':'B','E':'C'}))
Out[114]: array([ True, False, False, True, True])
I have two numpy arrays and I want to test for equality.
The following works correctly:
# this works
x = np.array([np.array(['a', 'b']), np.array(['c', 'd'])], dtype='object')
y = np.array([np.array(['a', 'b']), np.array(['c', 'd'])], dtype='object')
assert np.testing.assert_array_equal(x,y)
If one of the internal arrays is ragged however, comparison fails:
# this works
x = np.array([np.array(['a', 'b']), np.array(['c'])], dtype='object')
y = np.array([np.array(['a', 'b']), np.array(['c'])], dtype='object')
np.testing.assert_array_equal(x,y)
Traceback (most recent call last):
File "/home/.../test.py", line 12, in <module>
np.testing.assert_array_equal(x,y)
File "/home/.../lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 932, in assert_array_equal
assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
File "/home/.../lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 842, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not equal
Mismatched elements: 1 / 1 (100%)
x: array([array(['a', 'b'], dtype='<U1'), array(['c'], dtype='<U1')],
dtype=object)
y: array([array(['a', 'b'], dtype='<U1'), array(['c'], dtype='<U1')],
dtype=object)
UPDATE:
To make the story even more obscure, the following works:
x = np.array([np.array(['a', 'b']), np.array(['c'])], dtype='object')
y = x
np.testing.assert_array_equal(x,y)
Is this the correct behaviour?
In the first case, the arrays are (2,2) (despite the object dtype):
In [20]: x = np.array([np.array(['a', 'b']), np.array(['c', 'd'])], dtype='object')
...: y = np.array([np.array(['a', 'b']), np.array(['c', 'd'])], dtype='object')
In [21]: x
Out[21]:
array([['a', 'b'],
['c', 'd']], dtype=object)
In [22]: x.shape
Out[22]: (2, 2)
In [23]: x==y
Out[23]:
array([[ True, True],
[ True, True]])
The assert just has to verify that all elements of this comparison are True
The second case:
In [24]: x = np.array([np.array(['a', 'b']), np.array(['c'])], dtype='object')
...: y = np.array([np.array(['a', 'b']), np.array(['c'])], dtype='object')
In [25]: x
Out[25]:
array([array(['a', 'b'], dtype='<U1'), array(['c'], dtype='<U1')],
dtype=object)
In [26]: x.shape
Out[26]: (2,)
In [27]: x==y
<ipython-input-27-051436df861e>:1: DeprecationWarning: elementwise comparison failed;
this will raise an error in the future.
x==y
Out[27]: False
The result is a scalar, not a (2,) array. x==x produces True, with the same warning.
The array elements could be compared pairwise:
In [30]: [i==j for i,j in zip(x,y)]
Out[30]: [array([ True, True]), array([ True])]
I have an 'empty' 2D array in numpy as
arr = np.array([[[], [], []], [[], [], []]]).
When I do np.transpose(arr), I get the result: [], instead of the expected:
[[[],[]],[[],[]],[[],[]]].
You get an empty array [] with the right shape. Mind that also arr is an empty array [].
arr = np.array([[[], [], []], [[], [], []]])
print(arr, arr.shape)
t = arr.T
print(t, t.shape)
[] (2, 3, 0)
[] (0, 3, 2)
Look at what your expression produces:
In [41]: arr = np.array([[[], [], []], [[], [], []]])
In [42]: arr
Out[42]: array([], shape=(2, 3, 0), dtype=float64)
In [43]: print(arr)
[]
In [44]: print(repr(arr))
array([], shape=(2, 3, 0), dtype=float64)
The print shows the str display, while repr is a fuller one that tells us shape and dtype. np.array has followed the [] all the way down, making a 3d array that has float elements. But since the lowest level is created from [], it has size 0 dimension, and overall the array has 0 elements.
What you want, based on the comment, is a (2,3) array with object dtype. This can hold objects such as lists. But making that with np.array is tricky. A more general tool is to make one with the right shape and dtype.
I like to use empty for this, since is fills the object array with None elements. (In a numeric dtype np.empty has other problems, but for object it's nice.)
In [45]: arr = np.empty((2,3), dtype=object)
In [46]: arr
Out[46]:
array([[None, None, None],
[None, None, None]], dtype=object)
But trying to assign a list to elements of such an array can be tricky:
In [47]: arr[:]=[]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-47-b5ed8d639464> in <module>
----> 1 arr[:]=[]
ValueError: could not broadcast input array from shape (0) into shape (2,3)
In [48]: np.full((2,3),[])
...
ValueError: could not broadcast input array from shape (0) into shape (2,3)
full has the same problem. In addition assignment like this, if it worked, would put the same list in each slot, the equivalent of making a list with [[]]*3. We want a new [] in each slot.
We can do this in the 2d arr, but the iteration is simpler with a 1d (which can be reshaped later):
In [49]: arr = np.empty(6, object)
In [50]: arr
Out[50]: array([None, None, None, None, None, None], dtype=object)
In [51]: for i in range(6): arr[i]=[]
In [52]: arr
Out[52]:
array([list([]), list([]), list([]), list([]), list([]), list([])],
dtype=object)
In [53]: arr = np.reshape(arr, (2,3))
In [54]: arr
Out[54]:
array([[list([]), list([]), list([])],
[list([]), list([]), list([])]], dtype=object)
Obvious we could transpose that, but could just as well use (3,2) in the reshape.
Note that this display of arr clearly shows that it contains list objects.
But do you really need such an array? Is it worth the extra work?
nd2values[:,[1]]=nd2values[:,[1]].astype(int)
nd2values
outputs
array([['021fd159b55773fba8157e2090fe0fe2', '1',
'881f83d2dee3f18c7d1751659406144e',
'012059d397c0b7e5a30a5bb89c0b075e', 'A'],
['021fd159b55773fba8157e2090fe0fe2', '1',
'cec898a1d355dbfbad8c760615fde1af',
'012059d397c0b7e5a30a5bb89c0b075e', 'A'],
['021fd159b55773fba8157e2090fe0fe2', '1',
'a99f44bbff39e352191a870e17f04537',
'012059d397c0b7e5a30a5bb89c0b075e', 'A'],
...,
['fdeb2950c4d5209d449ebd2d6afac11e', '4',
'4f4e47023263931e1445dc97f7dae941',
'3cd0b15957ceb80f5125bef8bd1bbea7', 'A'],
['fdeb2950c4d5209d449ebd2d6afac11e', '4',
'021dabc5d7a1404ec8ad34fe8ca4b5e3',
'3cd0b15957ceb80f5125bef8bd1bbea7', 'A'],
['fdeb2950c4d5209d449ebd2d6afac11e', '4',
'f79a2b5e6190ac3c534645e806f1b611',
'3cd0b15957ceb80f5125bef8bd1bbea7', 'A']], dtype='<U32')
The data type of the second column is still str. Is it because this particular numpy array has dtype restriction? How would you change the second column to int? Thanks.
np.array(nd2values,dtype=[str,int,str,str,str])
gives
TypeError: data type not understood
A structured array alternative:
A copy-n-paste from the question gives me a (6,5) array with U32 dtype:
In [96]: arr.shape
Out[96]: (6, 5)
define a compound dtype:
In [99]: dt = np.dtype([('f0','U32'),('f1',int),('f2','U32'),('f3','U32'),('f4','U1')])
Input to a structured array should be a list of tuples:
In [100]: arrS = np.array([tuple(x) for x in arr], dt)
In [101]: arrS
Out[101]:
array([('021fd159b55773fba8157e2090fe0fe2', 1, '881f83d2dee3f18c7d1751659406144e', '012059d397c0b7e5a30a5bb89c0b075e', 'A'),
('021fd159b55773fba8157e2090fe0fe2', 1, 'cec898a1d355dbfbad8c760615fde1af', '012059d397c0b7e5a30a5bb89c0b075e', 'A'),
('021fd159b55773fba8157e2090fe0fe2', 1, 'a99f44bbff39e352191a870e17f04537', '012059d397c0b7e5a30a5bb89c0b075e', 'A'),
('fdeb2950c4d5209d449ebd2d6afac11e', 4, '4f4e47023263931e1445dc97f7dae941', '3cd0b15957ceb80f5125bef8bd1bbea7', 'A'),
('fdeb2950c4d5209d449ebd2d6afac11e', 4, '021dabc5d7a1404ec8ad34fe8ca4b5e3', '3cd0b15957ceb80f5125bef8bd1bbea7', 'A'),
('fdeb2950c4d5209d449ebd2d6afac11e', 4, 'f79a2b5e6190ac3c534645e806f1b611', '3cd0b15957ceb80f5125bef8bd1bbea7', 'A')],
dtype=[('f0', '<U32'), ('f1', '<i8'), ('f2', '<U32'), ('f3', '<U32'), ('f4', '<U1')])
One field can be accessed by name:
In [102]: arrS['f1']
Out[102]: array([1, 1, 1, 4, 4, 4])
The assignement is casting your ints to the type of the array. To be able to hold all kind of objects in an array set the dtype to object.
nd2values = nd2values.astype(object)
then
nd2values[:,[1]]=nd2values[:,[1]].astype(int)
I have two different arrays, one with strings and another with ints. I want to concatenate them, into one array where each column has the original datatype. My current solution for doing this (see below) converts the entire array into dtype = string, which seems very memory inefficient.
combined_array = np.concatenate((A, B), axis = 1)
Is it possible to mutiple dtypes in combined_array when A.dtype = string and B.dtype = int?
One approach might be to use a record array. The "columns" won't be like the columns of standard numpy arrays, but for most use cases, this is sufficient:
>>> a = numpy.array(['a', 'b', 'c', 'd', 'e'])
>>> b = numpy.arange(5)
>>> records = numpy.rec.fromarrays((a, b), names=('keys', 'data'))
>>> records
rec.array([('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4)],
dtype=[('keys', '|S1'), ('data', '<i8')])
>>> records['keys']
rec.array(['a', 'b', 'c', 'd', 'e'],
dtype='|S1')
>>> records['data']
array([0, 1, 2, 3, 4])
Note that you can also do something similar with a standard array by specifying the datatype of the array. This is known as a "structured array":
>>> arr = numpy.array([('a', 0), ('b', 1)],
dtype=([('keys', '|S1'), ('data', 'i8')]))
>>> arr
array([('a', 0), ('b', 1)],
dtype=[('keys', '|S1'), ('data', '<i8')])
The difference is that record arrays also allow attribute access to individual data fields. Standard structured arrays do not.
>>> records.keys
chararray(['a', 'b', 'c', 'd', 'e'],
dtype='|S1')
>>> arr.keys
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'keys'
A simple solution: convert your data to object 'O' type
z = np.zeros((2,2), dtype='U2')
o = np.ones((2,1), dtype='O')
np.hstack([o, z])
creates the array:
array([[1, '', ''],
[1, '', '']], dtype=object)
Refering Numpy doc, there is a function named numpy.lib.recfunctions.merge_arraysfunction which can be used to merge numpy arrays in different data type into either structured array or record array.
Example:
>>> from numpy.lib import recfunctions as rfn
>>> A = np.array([1, 2, 3])
>>> B = np.array(['a', 'b', 'c'])
>>> b = rfn.merge_arrays((A, B))
>>> b
array([(1, 'a'), (2, 'b'), (3, 'c')], dtype=[('f0', '<i4'), ('f1', '<U1')])
For more detail please refer the link above.