I have a list of ints, a, between 0 and 3000. len(a) = 3000. I have a for loop that is iterating through this list, searching for the indices of each elemenent in a larger array.
import numpy as np
a = [i for i in range(3000)]
array = np.random.randint(0, 3000, size(12, 1000, 1000))
newlist = []
for i in range(0, len(a)):
coord = np.where(array == list[i])
newlist.append(coord)
As you can see, coord will be 3 arrays of the coordinates x, y, z for the values in the 3D matrix that equal the value in the list.
Is there a way to do this in a vectorized manner without the for loop?
The output should be a list of tuples, one for each element in a:
# each coord looks like this:
print(coord)
(array[1, ..., 1000], array[2, ..., 1000], array[2, ..., 12])
# combined over all the iterations:
print(newlist)
[coord1, coord2, ..., coord3000]
There is actually a fully vectorized solution to this, despite the fact that the resulting arrays are all of different sizes. The idea is this:
Sort all the elements of the array along with their coordinates. argsort is ideal for this sort of thing.
Find the cut-points in the sorted data, so you know where to split the array, e.g. with diff and flatnonzero.
split the coordinate array along the indices you found. If you have missing elements, you may need to generate a key based on the first element of each run.
Here is an example to walk you through it. Let's say you have an d-dimensional array with size n. Your coordinates will be a (d, n) array:
d = arr.ndim
n = arr.size
You can generate the coordinate arrays with np.indices directly:
coords = np.indices(arr.shape)
Now ravel/reshape the data and the coordinates into an (n,) and (d, n) array, respectively:
arr = arr.ravel() # Ravel guarantees C-order no matter the source of the data
coords = coords.reshape(d, n) # C-order by default as a result of `indices` too
Now sort the data:
order = np.argsort(arr)
arr = arr[order]
coords = coords[:, order]
Find the locations where the data changes values. You want the indices of the new values, so we can make a fake first element that is 1 less than the actual first element.
change = np.diff(arr, prepend=arr[0] - 1)
The indices of the locations give the break-points in the array:
locs = np.flatnonzero(change)
You can now split the data at those locations:
result = np.split(coords, locs[1:], axis=1)
And you can create the key of values actually found:
key = arr[locs]
If you are very confident that all the values are present in the array, then you don't need the key. Instead, you can compute locs as just np.diff(arr) and result as just np.split(coords, inds, axis=1).
Each element in result is already consistent with the indexing used by where/nonzero, but as a numpy array. If specifically want a tuple, you can map it to a tuple:
result = [tuple(inds) for inds in result]
TL;DR
Combining all this into a function:
def find_locations(arr):
coords = np.indices(arr.shape).reshape(arr.ndim, arr.size)
arr = arr.ravel()
order = np.argsort(arr)
arr = arr[order]
coords = coords[:, order]
locs = np.flatnonzero(np.diff(arr, prepend=arr[0] - 1))
return arr[locs], np.split(coords, locs[1:], axis=1)
You can return a list of index arrays with empty arrays for missing elements by replacing the last line with
result = [np.empty(0, dtype=int)] * 3000 # Empty array, so OK to use same reference
for i, j in enumerate(arr[locs]):
result[j] = coords[i]
return result
You can optionally filter for values that are in the specific range you want (e.g. 0-2999).
You can use logical OR in numpy to pass all those equality conditions at once instead of one by one.
import numpy as np
conditions = False
for i in list:
conditions = np.logical_or(conditions,array3d == i)
newlist = np.where(conditions)
This allows numpy to do filtering once instead of n passes for each condition separately.
Another way to do it more compactly
np.where(np.isin(array3d, list))
Related
I have 2 arrays with the same shape. If the value of the element of the bList array corresponding to the aList array is 255, then find the corresponding position in the aList array, and add the eligible elements of the a array to calculate the average.
I think I can do it with loop but I think it's stupid.
import numpy as np
aList = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
bList = np.array([[255,255,0,255], [0,0,255,255], [255,0,0,0]])
sum_list = []
for id,row in enumerate(aList):
for index,ele in enumerate(row):
if bList[id][index]==255:
tmp = aList[id][index]
sum_list.append(tmp)
average = np.mean(sum_list) #(1+2+4+7+8+9)/6 #5.166666666666667
Is there a simpler way?
Use numpy.where
np.mean(aList[np.where(bList==255)])
Or with a boolean mask:
mask = (bList==255)
(aList*mask).sum()/mask.sum()
Output: 5.166666666666667
Suppose, I have multiple arrays in one array with the number from 0 to n in multiple order.
For example,
x = [[0,2,3,5],[1,4]]
Here we have two arrays in x. There could be more than two.
I want to get rearrange all the array elements based on their number sequence. However, they will represent their array ID. The result should be like this
y = [0,1,0,0,1,0]
That means 0,2,3,5 is in array id 0. So, they will show the id in their respective sequence. Same for 1 and 4. Can anyone help me to solve this? [N.B. There could be more than two arrays. So, it will be highly appreciated if the code work for different array numbers]
You can do this by using a dictionary
x = [[0,2,3,5],[1,4]]
lst = {}
for i in range(len(x)):
for j in range(len(x[i])):
lst[x[i][j]] = i
print(lst)
You can also do this by using list, list.insert(idx, value) means value is inserted to the list at the idxth index. Here, we are traversing through all the values of x and the value x[i][j] is in the i th number array.
x = [[0,2,3,5],[1,4]]
lst = []
for i in range(len(x)):
for j in range(len(x[i])):
lst.insert(x[i][j], i)
print(lst)
Output: [0, 1, 0, 0, 1, 0]
You might also consider using np.argsort for rearranging your array values and create the index-array with list comprehension:
x = [[0,2,3,5],[1,4]]
order = np.concatenate(x).argsort()
np.concatenate([ [i]*len(e) for i,e in enumerate(x) ])[order]
array([0, 1, 0, 0, 1, 0])
I have two 1D arrays, x & y, one smaller than the other. I'm trying to find the index of every element of y in x.
I've found two naive ways to do this, the first is slow, and the second memory-intensive.
The slow way
indices= []
for iy in y:
indices += np.where(x==iy)[0][0]
The memory hog
xe = np.outer([1,]*len(x), y)
ye = np.outer(x, [1,]*len(y))
junk, indices = np.where(np.equal(xe, ye))
Is there a faster way or less memory intensive approach? Ideally the search would take advantage of the fact that we are searching for not one thing in a list, but many things, and thus is slightly more amenable to parallelization.
Bonus points if you don't assume that every element of y is actually in x.
I want to suggest one-line solution:
indices = np.where(np.in1d(x, y))[0]
The result is an array with indices for x array which corresponds to elements from y which were found in x.
One can use it without numpy.where if needs.
As Joe Kington said, searchsorted() can search element very quickly. To deal with elements that are not in x, you can check the searched result with original y, and create a masked array:
import numpy as np
x = np.array([3,5,7,1,9,8,6,6])
y = np.array([2,1,5,10,100,6])
index = np.argsort(x)
sorted_x = x[index]
sorted_index = np.searchsorted(sorted_x, y)
yindex = np.take(index, sorted_index, mode="clip")
mask = x[yindex] != y
result = np.ma.array(yindex, mask=mask)
print result
the result is:
[-- 3 1 -- -- 6]
How about this?
It does assume that every element of y is in x, (and will return results even for elements that aren't!) but it is much faster.
import numpy as np
# Generate some example data...
x = np.arange(1000)
np.random.shuffle(x)
y = np.arange(100)
# Actually preform the operation...
xsorted = np.argsort(x)
ypos = np.searchsorted(x[xsorted], y)
indices = xsorted[ypos]
I think this is a clearer version:
np.where(y.reshape(y.size, 1) == x)[1]
than indices = np.where(y[:, None] == x[None, :])[1]. You don't need to broadcast x into 2D.
This type of solution I found to be best because unlike searchsorted() or in1d() based solutions that have seen posted here or elsewhere, the above works with duplicates and it doesn't care if anything is sorted. This was important to me because I wanted x to be in a particular custom order.
I would just do this:
indices = np.where(y[:, None] == x[None, :])[1]
Unlike your memory-hog way, this makes use of broadcast to directly generate 2D boolean array without creating 2D arrays for both x and y.
The numpy_indexed package (disclaimer: I am its author) contains a function that does exactly this:
import numpy_indexed as npi
indices = npi.indices(x, y, missing='mask')
It will currently raise a KeyError if not all elements in y are present in x; but perhaps I should add a kwarg so that one can elect to mark such items with a -1 or something.
It should have the same efficiency as the currently accepted answer, since the implementation is along similar lines. numpy_indexed is however more flexible, and also allows to search for indices of rows of multidimensional arrays, for instance.
EDIT: ive changed the handling of missing values; the 'missing' kwarg can now be set with 'raise', 'ignore' or 'mask'. In the latter case you get a masked array of the same length of y, on which you can call .compressed() to get the valid indices. Note that there is also npi.contains(x, y) if this is all you need to know.
Another solution would be:
a = np.array(['Bob', 'Alice', 'John', 'Jack', 'Brian', 'Dylan',])
z = ['Bob', 'Brian', 'John']
for i in z:
print(np.argwhere(i==a))
Use this line of code :-
indices = np.where(y[:, None] == x[None, :])[1]
My solution can additionally handle a multidimensional x. By default, it will return a standard numpy array of corresponding y indices in the shape of x.
If you can't assume that y is a subset of x, then set masked=True to return a masked array (this has a performance penalty). Otherwise, you will still get indices for elements not contained in y, but they probably won't be useful to you.
The answers by HYRY and Joe Kington were helpful in making this.
# For each element of ndarray x, return index of corresponding element in 1d array y
# If y contains duplicates, the index of the last duplicate is returned
# Optionally, mask indices where the x element does not exist in y
def matched_indices(x, y, masked=False):
# Flattened x
x_flat = x.ravel()
# Indices to sort y
y_argsort = y.argsort()
# Indices in sorted y of corresponding x elements, flat
x_in_y_sort_flat = y.searchsorted(x_flat, sorter=y_argsort)
# Indices in y of corresponding x elements, flat
x_in_y_flat = y_argsort[x_in_y_sort_flat]
if not masked:
# Reshape to shape of x
return x_in_y_flat.reshape(x.shape)
else:
# Check for inequality at each y index to mask invalid indices
mask = x_flat != y[x_in_y_flat]
# Reshape to shape of x
return np.ma.array(x_in_y_flat.reshape(x.shape), mask=mask.reshape(x.shape))
A more direct solution, that doesn't expect the array to be sorted.
import pandas as pd
A = pd.Series(['amsterdam', 'delhi', 'chromepet', 'tokyo', 'others'])
B = pd.Series(['chromepet', 'tokyo', 'tokyo', 'delhi', 'others'])
# Find index position of B's items in A
B.map(lambda x: np.where(A==x)[0][0]).tolist()
Result is:
[2, 3, 3, 1, 4]
more compact solution:
indices, = np.in1d(a, b).nonzero()
I have a function that returns many output arrays of varying size.
arr1,arr2,arr3,arr4,arr5, ... = func(data)
I want to run this function many times over a time series of data, and combine each output variable into one array that covers the whole time series.
To elaborate: If the output arr1 has dimensions (x,y) when the function is called, I want to run the function 't' times and end up with an array that has dimensions (x,y,t). A list of 't' arrays with size (x,y) would also be acceptable, but not preferred.
Again, the output arrays do not all have the same dimensions, or even the same number of dimensions. Arr2 might have size (x2,y2), arr3 might be only a vector of length (x3). I do not know the size of all of these arrays before hand.
My current solution is something like this:
arr1 = []
arr2 = []
arr3 = []
...
for t in range(t_max):
arr1_t, arr2_t, arr3_t, ... = func(data[t])
arr1.append(arr1_t)
arr2.append(arr2_t)
arr3.append(arr3_t)
...
and so on. However this is inelegant looking when repeated 27 times for each output array.
Is there a better way to do this?
You can just make arr1, arr2, etc. a list of lists (of vectors or matrices or whatever). Then use a loop to iterate the results obtained from func and add them to the individual lists.
arrN = [[] for _ in range(N)] # N being number of results from func
for t in range(t_max):
results = func(data[t])
for i, res in enumerate(results):
arrN[i].append(res)
The elements in the different sub-lists do not have to have the same dimensions.
Not sure if it counts as "elegant", but you can build a list of the result tuples then use zip to group them into tuples by return position instead of by call number, then optionally map to convert those tuples to the final data type. For example, with numpy array:
from future_builtins import map, zip # Only on Python 2, to minimize temporaries
import numpy as np
def func(x):
'Dumb function to return tuple of powers of x from 1 to 27'
return tuple(x ** i for i in range(1, 28))
# Example inputs for func
data = [np.array([[x]*10]*10, dtype=np.uint8) for in range(10)]
# Output is generator of results for each call to func
outputs = map(func, data)
# Pass each complete result of func as a positional argument to zip via star
# unpacking to regroup, so the first return from each func call is the first
# group, then the second return the second group, etc.
positional_groups = zip(*outputs)
# Convert regrouped data (`tuple`s of 2D results) to numpy 3D result type, unpack to names
arr1,arr2,arr3,arr4,arr5, ...,arr27 = map(np.array, positional_groups)
If the elements returned from func at a given position might have inconsistent dimensions (e.g. one call might return 10x10 as the first return, and another 5x5), you'd avoid the final map step (since the array wouldn't have consistent dimensions and just replace the second-to last step with:
arr1,arr2,arr3,arr4,arr5, ...,arr27 = zip(*outputs)
making arr# a tuple of 2D arrays, or if the need to be mutable:
arr1,arr2,arr3,arr4,arr5, ...,arr27 = map(list, zip(*outputs))
to make them lists of 2D arrays.
This answer gives a solution using structured arrays. It has the following requirement: Ggven a function f that returns N arrays, and the size of each of the returned arrays can be different -- then for all results of f, len(array_i) must always be same. eg.
arrs_a = f("a")
arrs_b = f("b")
for sub_arr_a, sub_arr_b in zip(arrs_a, arrs_b):
assert len(sub_arr_a) == len(sub_arr_b)
If the above is true, then you can use structured arrays. A structured array is like a normal array, just with a complex data type. For instance, I could specify a data type that is made up of one array of ints of shape 5, and a second array of floats of shape (2, 2). eg.
# define what a record looks like
dtype = [
# tuples of (field_name, data_type)
("a", "5i4"), # array of five 4-byte ints
("b", "(2,2)f8"), # 2x2 array of 8-byte floats
]
Using dtype you can create a structured array, and set all the results on the structured array in one go.
import numpy as np
def func(n):
"mock implementation of func"
return (
np.ones(5) * n,
np.ones((2,2))* n
)
# define what a record looks like
dtype = [
# tuples of (field_name, data_type)
("a", "5i4"), # array of five 4-byte ints
("b", "(2,2)f8"), # 2x2 array of 8-byte floats
]
size = 5
# create array
arr = np.empty(size, dtype=dtype)
# fill in values
for i in range(size):
# func must return a tuple
# or you must convert the returned value to a tuple
arr[i] = func(i)
# alternate way of instantiating arr
arr = np.fromiter((func(i) for i in range(size)), dtype=dtype, count=size)
# How to use structured arrays
# access individual record
print(arr[1]) # prints ([1, 1, 1, 1, 1], [[1, 1], [1, 1]])
# access specific value -- get second record -> get b field -> get value at 0,0
assert arr[2]['b'][0,0] == 2
# access all values of a specific field
print(arr['a']) # prints all the a arrays
I have yet to get my head around numpy array referencing.
I have arrays where the first 2 columns will always have some negative values that are necessary and the remaining columns need to have their negative values substituted with 0s. I understand that there are various ways to do this. The part that's baffling me is how to combine one of these methods with only doing it for the columns past 2.
Example array:
[[x, y, flow, element1, element2, element3] [x, y, flow, element1, element2, element3] [x, y, flow, element1, element2, element3]]
Desired result would be that for the whole array, any of the values that are negative are replaced with 0 unless they are x or y.
It sounds like you want:
subset = array[:, 2:]
subset[subset < 0] = 0
or as a rather unreadable one-liner:
array[:, 2:][array[:, 2:] < 0] = 0
As a more complete example:
import numpy as np
array = np.random.normal(0, 1, (10, 5))
print array
# Note that "subset" is a view, so modifying it modifies "array"
subset = array[:, 2:]
subset[subset < 0] = 0
print array
You would need to clip subsets of the arrays.
something like this:
a[2:].clip(0, None)
You could do this a couple of ways. One would be in a for loop:
for list in list_of_lists:
list[2:] = list[2:].clip(0, None)
Or, using [:, 2:] which references your list of lists (:), and then all sublists of that (2:).
The result is basically what Joe Kingston has suggested:
list[:, 2:] = list[:, 2:].clip(0, None)