boolean indexing from a subset of a list in python - python

I have an array of names, along with a corresponding array of data. From the array of names, there is also a smaller subset of names:
data = np.array([75., 49., 80., 87., 99.])
arr1 = np.array(['Bob', 'Joe', 'Mary', 'Ellen', 'Dick'], dtype='|S5')
arr2 = np.array(['Mary', 'Dick'], dtype='|S5')
I am trying to make a new array of data corresponding only to the names that appear in arr2. This is what I have been able to come up with on my own:
TF = []
for i in arr1:
if i in arr2:
TF.append(True)
else:
TF.append(False)
new_data = data[TF]
Is there a more efficient way of doing this that doesn't involve a for loop? I should mention that the arrays themselves are being input from an external file, and there are actually multiple arrays of data, so I can't really change anything about that.

You can use numpy.in1d, which tests whether each element in one array is also present in the second array.
Demo
>>> new_data = data[np.in1d(arr1, arr2)]
>>> new_data
array([ 80., 99.])
in1d returns an ndarray of bools, which is analogous to the list you constructed in your original code:
>>> np.in1d(arr1, arr2)
array([False, False, True, False, True], dtype=bool)

Related

Numpy Append Data to an empty array

I'm trying to write a code that will add 2 arrays(element by element) and store them in a 3rd array.
Basic Logic:
arr3[i] = arr1[i] + arr2[i]
For this, I have created two arrays arr1 and arr2. The result of the sum of arr1 and arr2 is getting appended in an empty array arr3.
code:
from numpy import append, array, int8
arr1 = array([1,2,3,4,5])
arr2 = array([2,4,6,8,10])
len = max(arr1.size,arr2.size)
arr3 = array([],dtype=int8)
for i in range(len):
append(arr3,arr1[i]+arr2[i])
print(arr1[i]+arr2[i])
print(arr3[i])
print(arr3)
In this code, I'm able to refer to elements of arr1 and arr2 and add them, but I'm not able to append the data to arr3.
Can anyone please help me to understand, what is the mistake in the code due to which I'm not able to store the data to arr3?
You can simply use
arr3 = arr1 + arr2
The reason why your code doesn't work is that append doesn't mutate the array, but returns a new one. You can simply modify your code like this:
for i in range(len):
arr3 = append(arr3,arr1[i]+arr2[i])
This could give indexing errors:
max(arr1.size,arr2.size)
if the arrays differ, range over this would produce index values that are too large for the smaller array.
The straight forward way of summing the 2 arrays is
In [79]: arr1 = np.array([1,2,3,4,5])
...: arr2 = np.array([2,4,6,8,10])
In [80]: arr1+arr2
Out[80]: array([ 3, 6, 9, 12, 15])
It makes optimal use of the numpy array definitions, and is fastest.
If you must iterate (for example because you want to learn from your mistakes), use something like (which is actually better if the inputs are lists, not arrays):
In [86]: alist = []
...: for x,y in zip(arr1,arr2):
...: alist.append(x+y)
...:
In [87]: alist
Out[87]: [3, 6, 9, 12, 15]
or better yet as a list comprehension
In [88]: [x+y for x,y in zip(arr1,arr2)]
Out[88]: [3, 6, 9, 12, 15]
I'm using zip instead of the arr1[i] types of range indexing. It's more concise, and less likely to produce errors.
np.append, despite the name, is not a list append clone. Read, if necessary reread, the np.append docs.
append : ndarray
A copy of `arr` with `values` appended to `axis`. Note that
`append` does not occur in-place: a new array is allocated and
filled. If `axis` is None, `out` is a flattened array.
This does work, but is slower:
In [90]: arr3 = np.array([])
...: for x,y in zip(arr1,arr2):
...: arr3 = np.append(arr3,x+y)
...:
In [91]: arr3
Out[91]: array([ 3., 6., 9., 12., 15.])
I would like to remove np.append, since it misleads far too many beginners.
Iteration like this is great for lists, but best avoided when working with numpy arrays. Learn to use the defined numpy operators and methods, and use elementwise iteration only as last resort.
First things first
Do Not Use built-in function name as variable.
len is a built-in function in python.
#sagi's answer is right. Writing the for loop would mean your code is not time-optimized.
But if you still want to understand where your code went wrong, check array shape
import numpy as np
arr3 = np.array([],dtype=int8)
print (arr3.shape)
>>> (0,)
Maybe you can create an empty array of the same shape as arr1 or arr2. Seems like for your problem they have same dimension.
arr3 = np.empty(arr1.shape, dtype=arr1.dtype)
arr3[:] = arr1 + arr2
If you are still persisting to use the dreaded for loop and append then use this--
list3 = []
for x, y in zip(arr1, arr2):
list3.append(x+y)
arr3 = np.asarray(list3)
print(arr3)
>>> array([ 3, 6, 9, 12, 15])
Cheers, good luck!!

How does numpy array typing interact with object?

I am currently trying to implement a datatype that stores floats in an numpy array. However trying to assign an array with elements of this type with various lengths seems to obviously break the code. One would assign a sequence to an array element, which is not possible.
One can bypass this by using the data type object instead of float. Why is that? How could one resolve this problem using floats without creating a sequence?
Example code that does not work.
from numpy import *
foo= dtype(float32, [])
x = array([[2., 3.], [3.]], dtype=foo)
Example code that does work:
from numpy import *
foo= dtype(float32, [])
x = array([[2., 3.], [3., 2.]], dtype=foo)
Example code that does work, I try to replicate for float:
from numpy import *
foo= dtype(object, [])
x = array([[2., 3.], [3.]], dtype=foo)
The object dtype in Numpy simply creates an array of pointers to Python objects. This means you lose the performance advantage you usually get from Numpy, but it's still sometimes useful to do this.
Your last example creates a one-dimensional Numpy array of length two, so that's two pointers to Python objects. Both these objects happen to be lists, and Python list have arbitrary dynamic length.
I don't know what you were trying to achieve with this, but note that
>>> np.dtype(np.float32, []) == np.float32
True
Arrays require the same number of elements for each row. So, if you feed a list of lists in numpy and all sublists have the same number of elements, it'll happily convert it to an array. This is why your second example works.
If the sublists are not the same length, then each sublist is treated as a single object and you end up with a 1D array of objects. This is why your third example works. Your first example doesn't work because you try to cast a sequence of objects to floats, which isn't possible.
In short, you can't create an array of floats if your sublists are of different lengths. At best, you can create an array of 1D arrays, since they are still considered objects.
>>> x = np.array(list(map(np.array, [[2., 3.], [3.]])))
>>> x
array([array([ 2., 3.]), array([ 3.])], dtype=object)
>>> x[0]
array([ 2., 3.])
>>> x[0][1]
3.0
>>> # but you can't do this
>>> x[0,1]
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
x[0,1]
IndexError: too many indices for array
If you're bent on creating a float 2D array, you have to extend all your sublists to the same size with None, which will be converted to np.nan.
>>> lists = [[2., 3.], [3.]]
>>> max_len = max(map(len, lists))
>>> for i, sublist in enumerate(lists):
sublist = sublist + [None] * (max_len - len(sublist))
lists[i] = sublist
>>> np.array(lists, dtype=np.float32)
array([[ 2., 3.],
[ 3., nan]], dtype=float32)

Compare two ndarrays with different dimensions

I have two ndarrays. First ndarray has string in one column and float values in another column. Second ndarray contains only a column of string values.
For eg:
Array1 Array2
"abc" 1.000 "abc"
"fsfds" -5.000 "qw"
"svs" 2.094 "svs"
"dfdsge" 3.348 "dd"
My question is, how can I compare matching string values from Array1 and Array2 then return corresponding float values from Array1?
I tried set(Array1) & set(Array2) to find unique elements but don't know how to extract float values. Is there a function in numpy?
Thank you.
The easiest way to turn your example into arrays is to copy-n-paste it as a multiline string and use genfromtxt to parse it:
In [344]: txt=b'''"abc" 1.000 "abc"
...: "fsfds" -5.000 "qw"
...: "svs" 2.094 "svs"
...: "dfdsge" 3.348 "dd" '''
In [346]: np.genfromtxt(txt.splitlines(),dtype=None)
Out[346]:
array([(b'"abc"', 1. , b'"abc"'), (b'"fsfds"', -5. , b'"qw"'),
(b'"svs"', 2.094, b'"svs"'), (b'"dfdsge"', 3.348, b'"dd"')],
dtype=[('f0', 'S8'), ('f1', '<f8'), ('f2', 'S5')])
With dtype=None it deduces column dtype, and creates a structured array. I can split that into 2 arrays, one with 2 fields, the other with 1. These are all 1d.
In [347]: arr1, arr2 = _[['f0','f1']], _['f2']
In [348]: arr1
Out[348]:
array([(b'"abc"', 1. ), (b'"fsfds"', -5. ), (b'"svs"', 2.094),
(b'"dfdsge"', 3.348)],
dtype=[('f0', 'S8'), ('f1', '<f8')])
In [349]: arr2
Out[349]:
array([b'"abc"', b'"qw"', b'"svs"', b'"dd"'],
dtype='|S5')
You are little unclear about how you want to compare the text columns. An easy one that looks reasonable with this data is just element by element, the simple ==.
In [350]: arr1['f0']==arr2
Out[350]: array([ True, False, True, False], dtype=bool)
With this boolean mask I can easily select the elements of arr1:
In [351]: arr1[_]
Out[351]:
array([(b'"abc"', 1. ), (b'"svs"', 2.094)],
dtype=[('f0', 'S8'), ('f1', '<f8')])
Lets see if I can turn these into object arrays.
In [372]: array1 = np.array(arr1.tolist(),dtype=object)
In [373]: array2 = np.array(arr2.tolist(),dtype=object)
In [374]: array1
Out[374]:
array([[b'"abc"', 1.0],
[b'"fsfds"', -5.0],
[b'"svs"', 2.094],
[b'"dfdsge"', 3.348]], dtype=object)
In [375]: array2
Out[375]: array([b'"abc"', b'"qw"', b'"svs"', b'"dd"'], dtype=object)
We can get the same mask:
In [376]: array1[:,0]==array2
Out[376]: array([ True, False, True, False], dtype=bool)
In [377]: array1[_,:]
Out[377]:
array([[b'"abc"', 1.0],
[b'"svs"', 2.094]], dtype=object)
Another way to get a mask:
In [378]: np.in1d(array2,array1[:,0])
Out[378]: array([ True, False, True, False], dtype=bool)
In this case it produces the same thing
Actually to get the rows of array1 that are in array2 (in any order), we need to switch the order:
In [389]: np.in1d(array1[:,0],array2[[1,0,3,2]])
Out[389]: array([ True, False, True, False], dtype=bool)
Look at in1d and the related array set functions for more ideas and details.
In any case, use field or column selection to get the 1d array of strings that can be compared to the strings in the other array.
You can use array comparison as your index for the first dimension to select the rows you want. I'm not sure exactly how you have an ndarray containing both strings and floats, but here's an example where we set it so the first and last rows have the same value in the first column.
import numpy as np
array_1 = np.random.randn(4, 2)
array_2 = np.random.randn(4)
array_2[3] = array_1[3, 0]
array_2[0] = array_1[0, 0]
print(array_1, array_2)
print(array_1[array_1[:, 0] == array_2, 1])
This gives
[[ 0.76170733 -1.40708366]
[-1.42535617 -1.03982291]
[ 0.67999753 -0.92733875]
[ 0.96474552 -1.95639871]]
[ 0.76170733 0.95046454 0.1548689 0.96474552]
[-1.40708366 -1.95639871]
I think that list comprehension can do the trick here:
Output=[i[1] for i in Array1 if i[0] in Array2]

Method for generating row-column arrays with other arrays

I am attempting to utilize numpy to the best of its capabilities, but I am obviously missing some important link in the documentation due to my 'Noob-ness'.
What I want to do is create an array with a certain number of rows and columns and populate it with a sub array. The sub array is incremented by a pair of values as one traverses along the row. For subsequent rows, another pair of values is used to populate the columns. The best I have come up with is to use list comprehensions to generate the desired output. At this stage I can create an array which doesn't have the desired shape...I can deal with that in an awkward fashion, so all is not lost.
Here is what I have so far:
>>> import numpy as np
>>> np.set_printoptions(precision=4,threshold=20,edgeitems=3,linewidth=80) # default print options
>>> sub = np.array([[1,2],[3,4],[5,6]],dtype='float64') # a sub array of floats
>>> sub
array([[ 1., 2.],
[ 3., 4.],
[ 5., 6.]])
>>> e = np.empty((3,4),dtype='object') # an empty array of the desired shape
>>> e
array([[None, None, None, None],
[None, None, None, None],
[None, None, None, None]], dtype=object)
>>> dX = 1; dY = np.sqrt(3.)/2.0 # values to add to sub array per cell in e
>>> rows,cols = e.shape # rows and columns from 'e' shape
>>> out = [sub + [dX*i,dY*(i%2)] for i in range(0,cols)] # create the first row
>>> for j in range(1,rows): # create the other rows
... out += [out[k] + [0,-dY*2*j] for k in range(cols)]
...
>>> arr = np.array(out)
>>> arr.shape # expect to see ((3,4),3,2)...I think
(12, 3, 2)
>>> arr[0:4] # I will let you try this to see the format
The last line just shows the format of the first 4 elements of the output array. What I was hoping to do was populate the empty array, e, in a fashion which is more elegant than my list comprehension method AND/OR how to reshape the array properly. Again, unless I am missing links in the documentation, I would have expected a 3x4 array containing 3x2 subarrays...which is not what it is showing me. I would appreciate any help or links to appropriate documentation, since, I have spent hours trolling this site and I am obviously missing some appropriate numpy terminology.
The first out is a list of 4 (3,2) arrays.
np.array(out) at this stage produces a (4,3,2) array. np.array creates the highest dimension array that the data allows. In this case is concatenates those 4 arrays along a new dimension.
After the rows loop, out is a list of 12 arrays. out +=... on a list appends them.
So by the same logic, arr = np.array(out) will produce a (12,3,2) array. That could be reshaped: arr = arr.reshape(3,4,3,2).
Subarrays from arr could be copied to e, e.g.:
e[0,0] = arr[0,0]
Which raises the question, why do you want array like e? What advantage does it have over arr? arr represents 'the best of numpy's capabilities,e` tries to extend to them in poorly developed areas.
Your out list can be vectorized with something along this line:
ii = np.arange(cols)
ixy = np.array([dX*ii, dY*(ii%2)])
arr1 = sub[None,:,:] + ixy.T[:,None,:]
arr1 is a (4,3,2) array, and could be copied to the e[0,:] elements.
This could be cleaned up and extended to the other rows.
A clean way of iterating over all the elements of e, and assigning the corresponding subarray of arr uses np.ndindex (from the index_tricks module):
for i in np.ndindex(3,4):
e[i]=arr[i]
While it is a Python level iteration, it does not involve copying data. It just copies pointers. I'm a little surprised about this, but e[i,j] points to the same data block as arr[i,j]. This is evident from the .__array_interface__ values, and by modifying entries, e.g.
e[1,1][0,0] = 30
changes the value of arr[1,1,0,0].

Shapes of the np.arrays, unexpected additional dimension

I'm dealing with arrays in python, and this generated a lot of doubts...
1) I produce a list of list reading 4 columns from N files and I store 4 elements for N times in a list. I then convert this list in a numpy array:
s = np.array(s)
and I ask for the shape of this array. The answer is correct:
print s.shape
#(N,4)
I then produce the mean of this Nx4 array:
s_m = sum(s)/len(s)
print s_m.shape
#(4,)
that I guess it means that this array is a 1D array. Is this correct?
2) If I subtract the mean vector s_m from the rows of the array s, I can proceed in two ways:
residuals_s = s - s_m
or:
residuals_s = []
for i in range(len(s)):
residuals_s.append([])
tmp = s[i] - s_m
residuals_s.append(tmp)
if I now ask for the shape of residuals_s in the two cases I obtain two different answers. In the first case I obtain:
(N,4)
in the second:
(N,1,4)
can someone explain why there is an additional dimension?
You can get the mean using the numpy method (producing the same (4,) shape):
s_m = s.mean(axis=0)
s - s_m works because s_m is 'broadcasted' to the dimensions of s.
If I run your second residuals_s I get a list containing empty lists and arrays:
[[],
array([ 1.02649662, 0.43613824, 0.66276758, 2.0082684 ]),
[],
array([ 1.13000227, -0.94129685, 0.63411801, -0.383982 ]),
...
]
That does not convert to a (N,1,4) array, but rather a (M,) array with dtype=object. Did you copy and paste correctly?
A corrected iteration is:
for i in range(len(s)):
residuals_s.append(s[i]-s_m)
produces a simpler list of arrays:
[array([ 1.02649662, 0.43613824, 0.66276758, 2.0082684 ]),
array([ 1.13000227, -0.94129685, 0.63411801, -0.383982 ]),
...]
which converts to a (N,4) array.
Iteration like this usually is not needed. But if it is, appending to lists like this is one way to go. Another is to pre allocate an array, and assign rows
residuals_s = np.zeros_like(s)
for i in range(s.shape[0]):
residuals_s[i,:] = s[i]-s_m
I get your (N,1,4) with:
In [39]: residuals_s=[]
In [40]: for i in range(len(s)):
....: residuals_s.append([])
....: tmp = s[i] - s_m
....: residuals_s[-1].append(tmp)
In [41]: residuals_s
Out[41]:
[[array([ 1.02649662, 0.43613824, 0.66276758, 2.0082684 ])],
[array([ 1.13000227, -0.94129685, 0.63411801, -0.383982 ])],
...]
In [43]: np.array(residuals_s).shape
Out[43]: (10, 1, 4)
Here the s[i]-s_m array is appended to an empty list, which has been appended to the main list. So it's an array within a list within a list. It's this intermediate list that produces the middle 1 dimension.
You are using NumPy ndarray without using the functions in NumPy, sum() is a python builtin function, you should use numpy.sum() instead.
I suggest you change your code as:
import numpy as np
np.random.seed(0)
s = np.random.randn(10, 4)
s_m = np.mean(a, axis=0, keepdims=True)
residuals_s = s - s_m
print s.shape, s_m.shape, residuals_s.shape
use mean() function with axis and keepdims arguments will give you the correct result.

Categories

Resources