Compare two ndarrays with different dimensions - python

I have two ndarrays. First ndarray has string in one column and float values in another column. Second ndarray contains only a column of string values.
For eg:
Array1 Array2
"abc" 1.000 "abc"
"fsfds" -5.000 "qw"
"svs" 2.094 "svs"
"dfdsge" 3.348 "dd"
My question is, how can I compare matching string values from Array1 and Array2 then return corresponding float values from Array1?
I tried set(Array1) & set(Array2) to find unique elements but don't know how to extract float values. Is there a function in numpy?
Thank you.

The easiest way to turn your example into arrays is to copy-n-paste it as a multiline string and use genfromtxt to parse it:
In [344]: txt=b'''"abc" 1.000 "abc"
...: "fsfds" -5.000 "qw"
...: "svs" 2.094 "svs"
...: "dfdsge" 3.348 "dd" '''
In [346]: np.genfromtxt(txt.splitlines(),dtype=None)
Out[346]:
array([(b'"abc"', 1. , b'"abc"'), (b'"fsfds"', -5. , b'"qw"'),
(b'"svs"', 2.094, b'"svs"'), (b'"dfdsge"', 3.348, b'"dd"')],
dtype=[('f0', 'S8'), ('f1', '<f8'), ('f2', 'S5')])
With dtype=None it deduces column dtype, and creates a structured array. I can split that into 2 arrays, one with 2 fields, the other with 1. These are all 1d.
In [347]: arr1, arr2 = _[['f0','f1']], _['f2']
In [348]: arr1
Out[348]:
array([(b'"abc"', 1. ), (b'"fsfds"', -5. ), (b'"svs"', 2.094),
(b'"dfdsge"', 3.348)],
dtype=[('f0', 'S8'), ('f1', '<f8')])
In [349]: arr2
Out[349]:
array([b'"abc"', b'"qw"', b'"svs"', b'"dd"'],
dtype='|S5')
You are little unclear about how you want to compare the text columns. An easy one that looks reasonable with this data is just element by element, the simple ==.
In [350]: arr1['f0']==arr2
Out[350]: array([ True, False, True, False], dtype=bool)
With this boolean mask I can easily select the elements of arr1:
In [351]: arr1[_]
Out[351]:
array([(b'"abc"', 1. ), (b'"svs"', 2.094)],
dtype=[('f0', 'S8'), ('f1', '<f8')])
Lets see if I can turn these into object arrays.
In [372]: array1 = np.array(arr1.tolist(),dtype=object)
In [373]: array2 = np.array(arr2.tolist(),dtype=object)
In [374]: array1
Out[374]:
array([[b'"abc"', 1.0],
[b'"fsfds"', -5.0],
[b'"svs"', 2.094],
[b'"dfdsge"', 3.348]], dtype=object)
In [375]: array2
Out[375]: array([b'"abc"', b'"qw"', b'"svs"', b'"dd"'], dtype=object)
We can get the same mask:
In [376]: array1[:,0]==array2
Out[376]: array([ True, False, True, False], dtype=bool)
In [377]: array1[_,:]
Out[377]:
array([[b'"abc"', 1.0],
[b'"svs"', 2.094]], dtype=object)
Another way to get a mask:
In [378]: np.in1d(array2,array1[:,0])
Out[378]: array([ True, False, True, False], dtype=bool)
In this case it produces the same thing
Actually to get the rows of array1 that are in array2 (in any order), we need to switch the order:
In [389]: np.in1d(array1[:,0],array2[[1,0,3,2]])
Out[389]: array([ True, False, True, False], dtype=bool)
Look at in1d and the related array set functions for more ideas and details.
In any case, use field or column selection to get the 1d array of strings that can be compared to the strings in the other array.

You can use array comparison as your index for the first dimension to select the rows you want. I'm not sure exactly how you have an ndarray containing both strings and floats, but here's an example where we set it so the first and last rows have the same value in the first column.
import numpy as np
array_1 = np.random.randn(4, 2)
array_2 = np.random.randn(4)
array_2[3] = array_1[3, 0]
array_2[0] = array_1[0, 0]
print(array_1, array_2)
print(array_1[array_1[:, 0] == array_2, 1])
This gives
[[ 0.76170733 -1.40708366]
[-1.42535617 -1.03982291]
[ 0.67999753 -0.92733875]
[ 0.96474552 -1.95639871]]
[ 0.76170733 0.95046454 0.1548689 0.96474552]
[-1.40708366 -1.95639871]

I think that list comprehension can do the trick here:
Output=[i[1] for i in Array1 if i[0] in Array2]

Related

Get boolean array indicating which elements in array which belong to a list

This seems to be a simple question but I am struggling with errors from quite some time.
Imagine an array
a = np.array([2,3,4,5,6])
I want to test which elements in the array belong to another list
[2,3,6]
If I do
a in [2,3,6]
Python raises "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
In return, i would like to get a boolean array-like
array([ True, True, False, False, True], dtype=bool)
Use np.isin to create a boolean mask then use np.argwhere on this mask to find the indices of array elements that are non-zero:
m = np.isin(a, lst)
indices = np.argwhere(m)
# print(m)
array([ True, True, False, False, True])
# print(indices)
array([[0], [1], [4]])
import numpy as np
arr1 = np.array([2,3,4,5,6])
arr2 = np.array([2,3,6])
arr_result = [bool(a1 in arr2) for a1 in arr1]
print(arr_result)
I have used simple list-comprehension logic to do this.
Output:
[True,True,False,False,True]

python numpy - unable to compare 2 arrays

I have the 2 arrays as follows:
x = array(['2019-02-28', '2019-03-01'], dtype=object)
z = array(['2019-02-28', '2019-03-02', '2019-03-01'], dtype=object)
I'm trying to use np.where to determine on which index the 2 matrixes are aligned.
I'm doing
i = np.where (z == x) but it doesn't work, I get an empty array as a result. It looks like it's comparing the whole array is equal to the other whole array whereas I'm looking for the matching values and would like to get matching results between the 2. How should I do it ?
Thanks
Regards
edit: expected outcome is yes [True, False, False]
The where result is only as good as the boolean it searches. If the argument does not have any True values, where returns empty:
In [308]: x = np.array(['2019-02-28', '2019-03-01'], dtype=object)
...: z = np.array(['2019-02-28', '2019-03-02', '2019-03-01'], dtype=object)
In [309]: x==z
/usr/local/bin/ipython3:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
#!/usr/bin/python3
Out[309]: False
If you aren't concerned about order:
In [311]: np.isin(z,x)
Out[311]: array([ True, False, True])
or trimming z:
In [312]: x==z[:2]
Out[312]: array([ True, False])
to extend x you could first use np.pad, or use itertools.zip_longest
In [353]: list(itertools.zip_longest(x,z))
Out[353]:
[('2019-02-28', '2019-02-28'),
('2019-03-01', '2019-03-02'),
(None, '2019-03-01')]
In [354]: [i==j for i,j in itertools.zip_longest(x,z)]
Out[354]: [True, False, False]
zip_longest accepts other fill values if that makes the comparison better.
Is this what you need:
print([i for i, (x, y) in enumerate(zip(x, z)) if x == y])
As the two arrays have different sizes compare over the minimum of the two sizes.
Edit:
I just reread the question and comments.
result= np.zeros( max(x.size, z.size), dtype=bool) # result size of the biggest array.
size = min(x.size, z.size)
result[:size] = z[:size] == x[:size] # Comparison at smallest size.
result
# array([ True, False, False])
This gives the boolean mask the comment asks for.
Original answer
import numpy as np
x = np.array(['2019-02-28', '2019-03-01'], dtype=object)
z = np.array(['2019-02-28', '2019-03-02', '2019-03-01'], dtype=object)
size = min(x.size, z.size)
np.where(z[:size]==x[:size]) # Select the common range
# (array([0], dtype=int64),)
On my machine this is slower than the list comprehension from #U10-Forward for dtype=object but faster if numpy selects the dtype, 'Unicode 10'.
x = np.array(['2019-02-28', '2019-03-01'])
z = np.array(['2019-02-28', '2019-03-02', '2019-03-01'])

What is going on behind this numpy selection behavior?

Answering this question, some others and I were actually wrong by considering that the following would work:
Say one has
test = [ [ [0], 1 ],
[ [1], 1 ]
]
import numpy as np
nptest = np.array(test)
What is the reason behind
>>> nptest[:,0]==[1]
array([False, False], dtype=bool)
while one has
>>> nptest[0,0]==[1],nptest[1,0]==[1]
(False, True)
or
>>> nptest==[1]
array([[False, True],
[False, True]], dtype=bool)
or
>>> nptest==1
array([[False, True],
[False, True]], dtype=bool)
Is it the degeneracy in term of dimensions which causes this.
nptest is a 2D array of object dtype, and the first element of each row is a list.
nptest[:, 0] is a 1D array of object dtype, each of whose elements are lists.
When you do nptest[:,0]==[1], NumPy does not perform an elementwise comparison of each element of nptest[:,0] against the list [1]. It creates as high-dimensional an array as it can from [1], producing the 1D array np.array([1]), and then broadcasts the comparison, comparing each element of nptest[:,0] against the integer 1.
Since no list in nptest[:, 0] is equal to 1, all elements of the result are False.

Numpy Chain Indexing

I am trying to gain a better understanding of numpy and have come across something I can't quite understand when it comes to indexing.
Let's say we have this first array of random booleans
bools = np.random.choice([True, False],(7),p=[0.5,0.5])
array([False, True, False, False, True, False, False], dtype=bool)
Then let's also say we have this second array of random numbers selected from a normal distribution
data = np.random.randn(7,3)
array([[ 2.24116809, -0.41761776, -0.69026077],
[-0.85450123, 0.98218741, 0.0233551 ],
[-1.3157436 , -0.79753471, 1.77393444],
[-0.26672724, -0.9532758 , 0.67114247],
[-1.34177843, 1.220083 , -0.35341168],
[ 0.49629327, 1.73943962, 0.59050431],
[ 0.01609382, 0.91396293, 0.3754827 ]])
Using the numpy chain indexing I can do this
data[bools, 2:]
array([[ 0.0233551 ],
[-0.35341168]])
Now let's say I want to simply grab the first element, I can do this
data[bools, 2:][0]
array([ 0.0233551])
But why does this, data[bools, 2:, 0] not work?
But why does this, data[bools, 2:, 0] not work?
Because the input is a 2D array and as such you don't have three dimensions there to use something like : [bools, 2:, 0].
To achieve what you want you are trying to do, you could store the indices corresponding to the True ones in the mask bools and then use it as whole or one element from it for indexing.
A sample run to make things clear -
Inputs :
In [40]: data
Out[40]:
array([[ 1.02429045, 1.74104271, -0.54634826],
[-0.48451969, 0.83455196, 1.94444857],
[ 0.66504345, 0.41821317, 2.52517305],
[ 2.11428982, -0.05769528, 0.84432614],
[ 0.9251009 , -0.74646199, -0.93573164],
[ 0.07321257, -0.10708067, 1.78107884],
[-0.12961046, -0.5787856 , 0.2189466 ]])
In [41]: bools
Out[41]: array([ True, True, False, False, False, False, True], dtype=bool)
Store the valid indices :
In [42]: idx = np.flatnonzero(bools)
In [43]: idx
Out[43]: array([0, 1, 6])
Use as a whole or its first element :
In [44]: data[idx, 2:] # Same as data[bools, 2:]
Out[44]:
array([[-0.54634826],
[ 1.94444857],
[ 0.2189466 ]])
In [45]: data[idx[0], 2:]
Out[45]: array([-0.54634826])
I haven't seen 2d numpy indexing called 'chaining'
data is 2d, and thus can be indexed with a 2 element tuple
data[bools, 2:]
data([bools, slice(2,None,None))]
That can also be expressed as
data[bools,:][:,2:]
where it first selects from rows, and then from columns.
Notice that your indexing produces a (2,1) array; 2 from the number of True in bool, and 1 from the length of the 2: slice.
Your 2nd indexing with [0] is really a row selection:
data[bools, 2:][0]
data[bools, 2:][0,:]
The result is a (1,) array, the size of the 2nd dimension of the intermediate array.

boolean indexing from a subset of a list in python

I have an array of names, along with a corresponding array of data. From the array of names, there is also a smaller subset of names:
data = np.array([75., 49., 80., 87., 99.])
arr1 = np.array(['Bob', 'Joe', 'Mary', 'Ellen', 'Dick'], dtype='|S5')
arr2 = np.array(['Mary', 'Dick'], dtype='|S5')
I am trying to make a new array of data corresponding only to the names that appear in arr2. This is what I have been able to come up with on my own:
TF = []
for i in arr1:
if i in arr2:
TF.append(True)
else:
TF.append(False)
new_data = data[TF]
Is there a more efficient way of doing this that doesn't involve a for loop? I should mention that the arrays themselves are being input from an external file, and there are actually multiple arrays of data, so I can't really change anything about that.
You can use numpy.in1d, which tests whether each element in one array is also present in the second array.
Demo
>>> new_data = data[np.in1d(arr1, arr2)]
>>> new_data
array([ 80., 99.])
in1d returns an ndarray of bools, which is analogous to the list you constructed in your original code:
>>> np.in1d(arr1, arr2)
array([False, False, True, False, True], dtype=bool)

Categories

Resources