I have a numpy array, whose elements are unique, for example:
b = np.array([5, 4, 6, 8, 1, 2])
(Edit2: b can have large numbers, and float numbers. The above example is there for simplicity)
I am getting numbers, that are elements in b.
I want to find their index in b, meaning I want a reverse mapping, from value to index, in b.
I could do
for number in input:
ind = np.where(number==b)
which would iterate over the entire array every call to where.
I could also create a dictionary,
d = {}
for i, element in enumerate(list(b)):
d[element] = i
I could create this dictionary at "preprocessing" time, but still I would be left with a strange looking dictionary, in a mostly numpy code, which seems (to me) not how numpy is meant to be used.
How can I do this reverse mapping in numpy?
usage (O(1) time and memory required):
print("index of 8 is: ", foo(b, 8))
Edit1: not a duplicate of this
Using in1d like explained here doesn't solve my problem. Using their example:
b = np.array([1, 2, 3, 10, 4])
I want to be able to find for example 10's index in b, at runtime, in O(1).
Doing a pre-processing move
mapping = np.in1d(b, b).nonzero()[0]
>> [0, 1, 2, 3, 4]
(which could be accomplished using np.arange(len(b)))
doesn't really help, because when 10 comes in as input, It is not possible to tell its index in O(1) time with this method.
It's simpler than you think, by exploiting numpy's advanced indexing.
What we do is make our target array and just assign usign b as an index. We'll assign the indices we want by using arange.
>>> t = np.zeros((np.max(b) + 1,))
>>> t[b] = np.arange(0, b.size)
>>> t
array([0., 4., 5., 0., 1., 0., 2., 0., 3.])
You might use nans or -1 instead of zeros to construct the target to help detect invalid lookups.
Memory usage: this is optimally performant in both space and time as it's handled entirely by numpy.
If you can tolerate collisions, you can implement a poor man's hashtable. Suppose we have currencies, for example:
h = np.int32(b * 100.0) % 101 # Typically some prime number
t = np.zeros((101,))
t[h] = np.arange(0, h.size)
# Retrieving a value v; keep in mind v can be an ndarray itself.
t[np.int32(v * 100.0) % 101]
You can do any other steps to munge the address if you know what your dataset looks like.
This is about the limit of what's useful to do with numpy.
Solution
If you want constant time (ie O(1)), then you'll need to precompute a lookup table of some sort. If you want to make your lookup table using another Numpy array, it'll effectively have to be a sparse array, in which most values are "empty". Here's a workable approach in which empty values are marked as -1:
b = np.array([5, 4, 6, 8, 1, 2])
_b_ix = np.array([-1]*(b.max() + 1))
_b_ix[b] = np.arange(b.size)
# _b_ix: array([-1, 4, 5, -1, 1, 0, 2, -1, 3])
def foo(*val):
return _b_ix[list(val)]
Test:
print("index of 8 is: %s" % foo(8))
print("index of 0,5,1,8 is: %s" % foo(0,5,1,8))
Output:
index of 8 is: [3]
index of 0,5,1,8 is: [-1 0 4 3]
Caveat
In production code, you should definitely use a dictionary to solve this problem, as other answerers have pointed out. Why? Well, for one thing, say that your array b contains float values, or any non-int value. Then a Numpy-based lookup table won't work at all.
Thus, you should use the above answer only if you have a deep-seated philosophical opposition to using a dictionary (eg a dict ran over your pet cat).
Here's a nice way to generate a reverse lookup dict:
ix = {k:v for v,k in enumerate(b.flat)}
You can use dict, zip and numpy.arrange to create your reverse lookup:
import numpy
b = np.array([5, 4, 6, 8, 1, 2])
d = dict(zip(b, np.arange(0,len(b))))
print(d)
gives:
{5: 0, 4: 1, 6: 2, 8: 3, 1: 4, 2: 5}
If you want to do multiple lookups, you can do these in O(1) after an initial O(n) traversal to create a lookup dictionary.
b = np.array([5, 4, 6, 8, 1, 2])
lookup_dict = {e:i for i,e in enumerate(b)}
def foo(element):
return lookup_dict[element]
And this works for your test:
>>> print('index of 8 is:', foo(8))
index of 8 is: 3
Note that if there is a possibility that b may have changed since the last foo() call, we must re-create the dictionary.
Related
I'm trying to do some calculation (mean, sum, etc.) on a list containing numpy arrays.
For example:
list = [array([2, 3, 4]),array([4, 4, 4]),array([6, 5, 4])]
How can retrieve the mean (for example) ?
In a list like [4,4,4] or a numpy array like array([4,4,4]) ?
Thanks in advance for your help!
EDIT : Sorry, I didn't explain properly what I was aiming to do : I would like to get the mean of i-th index of the arrays. For example, for index 0 :
(2+4+6)/3 = 4
I don't want this :
(2+3+4)/3 = 3
Therefore the end result will be
[4,4,4] / and not [3,4,5]
If L were a list of scalars then calculating the mean could be done using the straight forward expression:
sum(L) / len(L)
Luckily, this works unchanged on lists of arrays:
L = [np.array([2, 3, 4]), np.array([4, 4, 4]), np.array([6, 5, 4])]
sum(L) / len(L)
# array([4., 4., 4.])
For this example this happens to be quitea bit faster than the numpy function
np.mean
timeit(lambda: np.mean(L, axis=0))
# 13.708808058872819
timeit(lambda: sum(L) / len(L))
# 3.4780975924804807
You can use a for loop and iterate through the elements of your array, if your list is not too big:
mean = []
for i in range(len(list)):
mean.append(np.mean(list[i]))
Given a 1d array a, np.mean(a) should do the trick.
If you have a 2d array and want the means for each one separately, specify np.mean(a, axis=1).
There are equivalent functions for np.sum, etc.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
You can use map
import numpy as np
my_list = [np.array([2, 3, 4]),np.array([4, 4, 4]),np.array([6, 5, 4])]
np.mean(my_list,axis=0) #[4,4,4]
Note: Do not name your variable as list as it will shadow the built-ins
Let's say I have an array like:
arr = np.array([[1,20,5],
[1,20,8],
[3,10,4],
[2,30,6],
[3,10,5]])
and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.
Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.
Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.
Is there a fast way to implement Approach 2 without resorting to a pure Python loop?
numpy approach:
u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)
dict(np.stack([u, s]).T)
{1: 13, 2: 6, 3: 9}
pandas approach:
import pandas as pd
import numpy as np
pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()
{1: 13, 2: 6, 3: 9}
Here's a NumPy based approach using np.add.reduceat -
sidx = arr[:,0].argsort()
idx = np.append(0,np.where(np.diff(arr[sidx,0])!=0)[0]+1)
keys = arr[sidx[idx],0]
vals = np.add.reduceat(arr[sidx,2],idx,axis=0)
If you would like to get the keys and values in a 2-column array -
out = np.column_stack((keys,vals)) # If you
Sample run -
In [351]: arr
Out[351]:
array([[ 1, 20, 5],
[ 1, 20, 8],
[ 3, 10, 4],
[ 2, 30, 6],
[ 3, 10, 5]])
In [352]: out
Out[352]:
array([[ 1, 13],
[ 2, 6],
[ 3, 9]])
This is a typical grouping problem, which the numpy_indexed package solves efficiently and elegantly (if I may say so myself; I am its author)
import numpy_indexed as npi
npi.group_by(arr[:, 0]).sum(arr[:, 2])
Its a much more lightweight solution than the pandas package, and I think the syntax is cleaner, as there is no need to create a special datastructure just to perform this type of elementary operation. Performance should be identical to the solution posed by Divakar, since it follows the same steps; just with a nice and tested interface on top.
The proper way of doing this is with NumPy is using np.bincount. If your unique first column labels are already small contiguous integers you could simply do:
cum_sums = np.bincount(arr[:, 0], weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in enumerate(cum_sums)
if cum_sum != 0}
where the cum_sum != 0 is an attempt to skip missing first column labels, which may be grossly wrong if your third column includes negative numbers.
Alternatively you can do things properly and call np.unique first and do:
uniques, indices = np.unique(arr[:, 0], return_inverse=True)
cum_sums = np.bincount(indices, weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in zip(uniques, cum_sums)}
What is the best way to implement a function which takes an arbitrary number of 1d arrays and returns a tuple containing the indices of the matching values (if any).
Here is some pseudo-code of what I want to do:
a = np.array([1, 0, 4, 3, 2])
b = np.array([1, 2, 3, 4, 5])
c = np.array([4, 2])
(ind_a, ind_b, ind_c) = return_equals(a, b, c)
# ind_a = [2, 4]
# ind_b = [1, 3]
# ind_c = [0, 1]
(ind_a, ind_b, ind_c) = return_equals(a, b, c, sorted_by=a)
# ind_a = [2, 4]
# ind_b = [3, 1]
# ind_c = [0, 1]
def return_equals(*args, sorted_by=None):
...
You can use numpy.intersect1d with reduce for this:
def return_equals(*arrays):
matched = reduce(np.intersect1d, arrays)
return np.array([np.where(np.in1d(array, matched))[0] for array in arrays])
reduce may be little slow here because we are creating intermediate NumPy arrays here(for large number of input it may be very slow), we can prevent this if we use Python's set and its .intersection() method:
matched = np.array(list(set(arrays[0]).intersection(*arrays[1:])))
Related GitHub ticket: n-array versions of set operations, especially intersect1d
This solution basically concatenates all input 1D arrays into one big 1D array with the intention of performing the required operations in a vectorized manner. The only place where it uses loop is at the start where it gets the lengths of the input arrays, which must be minimal on runtime costs.
Here's the function implementation -
import numpy as np
def return_equals(*argv):
# Concatenate input arrays into one big array for vectorized processing
A = np.concatenate((argv[:]))
# lengths of input arrays
narr = len(argv)
lens = np.zeros((1,narr),int).ravel()
for i in range(narr):
lens[i] = len(argv[i])
N = A.size
# Start indices of each group of identical elements from different input arrays
# in a sorted version of the huge concatenated input array
start_idx = np.where(np.append([True],np.diff(np.sort(A))!=0))[0]
# Runlengths of islands of identical elements
runlens = np.diff(np.append(start_idx,N))
# Starting and all indices of the positions in concatenate array that has
# islands of identical elements which are present across all input arrays
good_start_idx = start_idx[runlens==narr]
good_all_idx = good_start_idx[:,None] + np.arange(narr)
# Get offsetted indices and sort them to get the desired output
idx = np.argsort(A)[good_all_idx] - np.append([0],lens[:-1].cumsum())
return np.sort(idx.T,1)
In Python:
def return_equal(*args):
rtr=[]
for i, arr in enumerate(args):
rtr.append([j for j, e in enumerate(arr) if
all(e in a for a in args[0:i]) and
all(e in a for a in args[i+1:])])
return rtr
>>> return_equal(a,b,c)
[[2, 4], [1, 3], [0, 1]]
For start, I'd try:
def return_equals(*args):
x=[]
c=args[-1]
for a in args:
x.append(np.nonzero(np.in1d(a,c))[0])
return x
If I add a d=np.array([1,0,4,3,0]) (it has only 1 match; what if there are no matches?)
then
return_equals(a,b,d,c)
produces:
[array([2, 4], dtype=int32),
array([1, 3], dtype=int32),
array([2], dtype=int32),
array([0, 1], dtype=int32)]
Since the length of both input and returned arrays can differ, you really can't vectorize the problem. That is, it takes some special gymnastics to perform the operation across all inputs at once. And if the number of arrays is small compared to their typical length, I wouldn't worry about speed. Iterating a few times is not expensive. It's iterating over a 100 values that's expensive.
You could, of course, pass the keyword arguments on to in1d.
It's not clear what you are trying to do with the sorted_by parameter. Is that something that you could just as easily apply to the arrays before you pass them to this function?
List comprehension version of this iteration:
[np.nonzero(np.in1d(x,c))[0] for x in [a,b,d,c]]
I can imagine concatenating the arrays into one longer one, applying in1d, and then splitting it up into subarrays. There is a np.split, but it requires that you tell it how many elements to put in each sublist. That means, somehow, determining how many matches there are for each argument. Doing that without looping could be tricky.
The pieces for this (that still need to be packed as function) are:
args=[a,b,d,c]
lens=[len(x) for x in args]
abc=np.concatenate(args)
C=np.cumsum(lens)
I=np.nonzero(np.in1d(abc,c))[0]
S=np.split(I,(2,4,5))
[S[0],S[1]-C[0],S[2]-C[1],S[3]-C[2]]
I
# array([ 2, 4, 6, 8, 12, 15, 16], dtype=int32)
C
# array([ 5, 10, 15, 17], dtype=int32)
The (2,4,5) are the number of elements of I between successive values of C, i.e. the number of elements that match for each of a,b,...
Consider the following NumPy array:
a = np.array([[1,4], [2,1],(3,10),(4,8)])
This gives an array that looks like the following:
array([[ 1, 4],
[ 2, 1],
[ 3, 10],
[ 4, 8]])
What I'm trying to do is find the minimum value of the second column (which in this case is 1), and then report the other value of that pair (in this case 2). I've tried using something like argmin, but that gets tripped up by the 1 in the first column.
Is there a way to do this easily? I've also considered sorting the array, but I can't seem to get that to work in a way that keeps the pairs together. The data is being generated by a loop like the following, so if there's a easier way to do this that isn't a numpy array, I'd take that as an answer too:
results = np.zeros((100,2))
# Loop over search range, change kappa each time
for i in range(100):
results[i,0] = function1(x)
results[i,1] = function2(y)
How about
a[np.argmin(a[:, 1]), 0]
Break-down
a. Grab the second column
>>> a[:, 1]
array([ 4, 1, 10, 8])
b. Get the index of the minimum element in the second column
>>> np.argmin(a[:, 1])
1
c. Index a with that to get the corresponding row
>>> a[np.argmin(a[:, 1])]
array([2, 1])
d. And take the first element
>>> a[np.argmin(a[:, 1]), 0]
2
Using np.argmin is probably the best way to tackle this. To do it in pure python, you could use:
min(tuple(r[::-1]) for r in a)[::-1]
I've been writing a program to brute force check a sequence of numbers to look for euler bricks, but the method that I came up with involves a triple loop. Since nested Python loops get notoriously slow, I was wondering if there was a better way using numpy to create the array of values that I need.
#x=max side length of brick. User Input.
for t in range(3,x):
a=[];b=[];c=[];
for u in range(2,t):
for v in range(1,u):
a.append(t)
b.append(u)
c.append(v)
a=np.array(a)
b=np.array(b)
c=np.array(c)
...
Is there a better way to generate the array af values, using numpy commands?
Thanks.
Example:
If x=10, when t=3 I want to get:
a=[3]
b=[2]
c=[1]
the first time through the loop. After that, when t=4:
a=[4, 4, 4]
b=[2, 3, 3]
c=[1, 1, 2]
The third time (t=5) I want:
a=[5, 5, 5, 5, 5, 5]
b=[2, 3, 3, 4, 4, 4]
c=[1, 1, 2, 1, 2, 3]
and so on, up to max side lengths around 5000 or so.
EDIT: Solution
a=array(3)
b=array(2)
c=array(1)
for i in range(4,x): #Removing the (3,2,1) check from code does not affect results.
foo=arange(1,i-1)
foo2=empty(len(foo))
foo2.fill(i-1)
c=hstack((c,foo))
b=hstack((b,foo2))
a=empty(len(b))
a.fill(i)
...
Works many times faster now. Thanks all.
Try to use .empty and .fill (http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.fill.html)
There are couple of things which could help, but probably only for large values of x. For starters use xrange instead of range, that will save creating a list you never need. You could also create empty numpy arrays of the correct length and fill them up with the values as you go, instead of appending to a list and then converting it into a numpy array.
I believe this code will work (no python access right this second):
for t in xrange(3, x):
size = (t - 2) * (t - 3)
a = np.zeros(size)
b = np.zeros(size)
c = np.zeros(size)
idx = 0
for u in xrange(2,t):
for v in xrange(1,u):
a[idx] = t
b[idx] = u
c[idx] = v
idx += 1