numpy: ndarray index with repeated values [duplicate] - python

I have unsorted array of indexes:
i = np.array([1,5,2,6,4,3,6,7,4,3,2])
I also have an array of values of the same length:
v = np.array([2,5,2,3,4,1,2,1,6,4,2])
I have array with zeros of desired values:
d = np.zeros(10)
Now I want to add to elements in d values of v based on it's index in i.
If I do it in plain python I would do it like this:
for index,value in enumerate(v):
idx = i[index]
d[idx] += v[index]
It is ugly and inefficient. How can I change it?

np.add.at(d, i, v)
You'd think d[i] += v would work, but if you try to do multiple additions to the same cell that way, one of them overrides the others. The ufunc.at method avoids those problems.

We can use np.bincount which is supposedly pretty efficient for such accumulative weighted counting, so here's one with that -
counts = np.bincount(i,v)
d[:counts.size] = counts
Alternatively, using minlength input argument and for a generic case when d could be any array and we want to add into it -
d += np.bincount(i,v,minlength=d.size).astype(d.dtype, copy=False)
Runtime tests
This section compares np.add.at based approach listed in the other post with the np.bincount based one listed earlier in this post.
In [61]: def bincount_based(d,i,v):
...: counts = np.bincount(i,v)
...: d[:counts.size] = counts
...:
...: def add_at_based(d,i,v):
...: np.add.at(d, i, v)
...:
In [62]: # Inputs (random numbers)
...: N = 10000
...: i = np.random.randint(0,1000,(N))
...: v = np.random.randint(0,1000,(N))
...:
...: # Setup output arrays for two approaches
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [63]: bincount_based(d1,i,v) # Run approaches
...: add_at_based(d2,i,v)
...:
In [64]: np.allclose(d1,d2) # Verify outputs
Out[64]: True
In [67]: # Setup output arrays for two approaches again for timing
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [68]: %timeit add_at_based(d2,i,v)
1000 loops, best of 3: 1.83 ms per loop
In [69]: %timeit bincount_based(d1,i,v)
10000 loops, best of 3: 52.7 µs per loop

Related

Summing array from indexes and values from other array: better way to do it in numpy? [duplicate]

I have unsorted array of indexes:
i = np.array([1,5,2,6,4,3,6,7,4,3,2])
I also have an array of values of the same length:
v = np.array([2,5,2,3,4,1,2,1,6,4,2])
I have array with zeros of desired values:
d = np.zeros(10)
Now I want to add to elements in d values of v based on it's index in i.
If I do it in plain python I would do it like this:
for index,value in enumerate(v):
idx = i[index]
d[idx] += v[index]
It is ugly and inefficient. How can I change it?
np.add.at(d, i, v)
You'd think d[i] += v would work, but if you try to do multiple additions to the same cell that way, one of them overrides the others. The ufunc.at method avoids those problems.
We can use np.bincount which is supposedly pretty efficient for such accumulative weighted counting, so here's one with that -
counts = np.bincount(i,v)
d[:counts.size] = counts
Alternatively, using minlength input argument and for a generic case when d could be any array and we want to add into it -
d += np.bincount(i,v,minlength=d.size).astype(d.dtype, copy=False)
Runtime tests
This section compares np.add.at based approach listed in the other post with the np.bincount based one listed earlier in this post.
In [61]: def bincount_based(d,i,v):
...: counts = np.bincount(i,v)
...: d[:counts.size] = counts
...:
...: def add_at_based(d,i,v):
...: np.add.at(d, i, v)
...:
In [62]: # Inputs (random numbers)
...: N = 10000
...: i = np.random.randint(0,1000,(N))
...: v = np.random.randint(0,1000,(N))
...:
...: # Setup output arrays for two approaches
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [63]: bincount_based(d1,i,v) # Run approaches
...: add_at_based(d2,i,v)
...:
In [64]: np.allclose(d1,d2) # Verify outputs
Out[64]: True
In [67]: # Setup output arrays for two approaches again for timing
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [68]: %timeit add_at_based(d2,i,v)
1000 loops, best of 3: 1.83 ms per loop
In [69]: %timeit bincount_based(d1,i,v)
10000 loops, best of 3: 52.7 µs per loop

How do a split integers within a list into single digits only?

Let's say I have something like this:
list(range(9:12))
Which gives me a list:
[9,10,11]
However I want it to be like:
[9,1,0,1,1]
Which splits every integer into single digits, is there anyway of achieving this without sacrificing too much performance? Or is there a way of generating list like these in the first place?
You can build the final result efficiently without having to build one large and/or small intermediate strings using itertools.chain.from_iterable.
In [18]: list(map(int, chain.from_iterable(map(str, range(9, 12)))))
Out[18]: [9, 1, 0, 1, 1]
In [12]: %%timeit
...: list(map(int, chain.from_iterable(map(str, range(9, 20)))))
...:
100000 loops, best of 3: 8.19 µs per loop
In [13]: %%timeit
...: [int(i) for i in ''.join(map(str, range(9, 20)))]
...:
100000 loops, best of 3: 9.15 µs per loop
In [14]: %%timeit
...: [int(x) for i in range(9, 20) for x in str(i)]
...:
100000 loops, best of 3: 9.92 µs per loop
Timings scale with input. The itertools version also uses memory efficiently although it is marginally slower than the str.join version if used with list(map(int, ...)):
In [15]: %%timeit
...: list(map(int, chain.from_iterable(map(str, range(9, 200)))))
...:
10000 loops, best of 3: 138 µs per loop
In [16]: %%timeit
...: [int(i) for i in ''.join(map(str, range(9, 200)))]
...:
10000 loops, best of 3: 159 µs per loop
In [17]: %%timeit
...: [int(x) for i in range(9, 200) for x in str(i)]
...:
10000 loops, best of 3: 182 µs per loop
In [18]: %%timeit
...: list(map(int, ''.join(map(str, range(9, 200)))))
...:
10000 loops, best of 3: 130 µs per loop
simplest way is,
>>> [int(i) for i in range(9,12) for i in str(i)]
[9, 1, 0, 1, 1]
>>>
Convert the integers to strings, then split() the string and reconvert the digits back to ints.
li = range(9,12)
digitlist = [int(d) for number in li for d in str(number)]
Output:
[9,1,0,1,1]
I've investigated how performant I can make this a little more. The first function I wrote was naive_single_digits, which uses the str approach, with a pretty efficient list comprehension.
def naive_single_digits(l):
return [int(c) for n in l
for c in str(n)]
As you can see, this approach works:
In [2]: naive_single_digits(range(9, 15))
Out[2]: [9, 1, 0, 1, 1, 1, 2, 1, 3, 1, 4]
However, I thought that it would surely be unecessary to always build a str object for each item in the list - all we actually need is a base conversion to digits. Out of laziness, I copied this function from here. I've optimised it a bit by specifying it to base 10.
def base10(n):
if n == 0:
return [0]
digits = []
while n:
digits.append(n % 10)
n //= 10
return digits[::-1]
Using this, I made
def arithmetic_single_digits(l):
return [i for n in l
for i in base10(n)]
which also behaves correctly:
In [3]: arithmetic_single_digits(range(9, 15))
Out[3]: [9, 1, 0, 1, 1, 1, 2, 1, 3, 1, 4]
Now to time it. I've also tested against one other answer (full disclosure: I modified it a bit to work in Python2, but that shouldn't have affected the performance much)
In [11]: %timeit -n 10 naive_single_digits(range(100000))
10 loops, best of 3: 173 ms per loop
In [10]: %timeit -n 10 list(map(int, itertools.chain(*map(str, range(100000)))))
10 loops, best of 3: 154 ms per loop
In [12]: %timeit arithmetic_single_digits(range(100000))
10 loops, best of 3: 93.3 ms per loop
As you can see arithmetic_single_digits is actually somewhat faster, although this is at the cost of more code and possibly less clarity. I've tested against ridiculously large inputs, so you can see a difference in performance - at any kind of reasonable scale, every answer here will be blazingly fast. Note that python's integer arithmetic is probably actually relatively slow, as it doesn't use a primitive integer type. If this were to be implemented in C, I'd suspect my approach to get a bit faster.
Comparing this to viblo's answer, using (pure) Python 3 (to my shame I haven't installed ipython for python 3):
print(timeit.timeit("digits(range(1, 100000))", number=10, globals=globals()))
print(timeit.timeit("arithmetic_single_digits(range(1, 100000))", number=10, globals=globals()))
This has the output of:
3.5284318959747907
0.806847038998967
My approach is quite a bit faster, presumably because I'm purely using integer arithmetic.
Another way to write an arithmetic solution. Compared to Izaak van Dongens solution this doesnt use a while loop but calculates upfront how many iterations it need in the list comprehension/loop.
import itertools, math
def digits(ns):
return list(itertools.chain.from_iterable(
[
[
(abs(n) - (abs(n) // 10 **x)*10**x ) // 10**(x-1)
for x
in range(1+math.floor(math.log10(abs(n) if n!=0 else 1)), 0, -1)]
for n in ns
]
))
digits([-11,-10,-9,0,9,10,11])
Turn it to a string then back into a list :)
lambda x: list(''.join(str(e) for e in x))
You can also do with map function
a=range(9,12)
res = []
b=[map(int, str(i)) for i in a]
for i in b:
res.extend(i)
print(res)
here is how I did it:
ls = range(9,12)
lsNew = []
length = len(ls)
for i in range(length):
item = ls[i]
string = str(item)
if len(string) > 1:
split = list(string)
lsNew = lsNew + split
else:
lsNew.append(item)
ls = lsNew
print(ls)
def breakall(L):
if L == []:
return []
elif L[0] < 10:
return [L[0]] + breakall(L[1:])
else:
return breakall([L[0]//10]) + [L[0] % 10] + breakall(L[1:])
print(breakall([9,10,12]))
-->
[9, 1, 0, 1, 2]

Looping and Searching in Numpy Array

I need to loop over a numpy array and then do the following search. The following is taking almost 60(s) for an array (npArray1 and npArray2 in the example below) with around 300K values.
In other words, I am looking for the index of the first occurence in npArray2
for every value of npArray1.
for id in np.nditer(npArray1):
newId=(np.where(npArray2==id))[0][0]
Is there anyway I can make the above faster using numpy? I need to run the script above on much bigger arrays (50M). Please note that my two numpy arrays in the lines above, npArray1 and npArray2 are not necessarily the same size, but they are both 1d.
Thanks a lot for your help,
The function np.unique will do much of the work for you:
npArray2 = np.random.randint(100,None,(1000,)) #1000-long vector of ints between 1 and 100, so lots of repeats
vals,idxs = np.unique(searchMe, return_index=True) #each unique value AND the index of its first appearance
for val in npArray1:
newId = idxs[vals==val][0]
vals is an array containing the unique values in npArray2, while idxs gives the index of the first appearance of each value in npArray2. Searching in vals should be much faster than in npArray1 because it's smaller.
You can speed up the search further by taking advantage of the fact that vals is sorted:
import bisect #we can use binary search since vals is sorted
for val in npArray1:
newId = idxs[bisect.bisect_left(vals, val)]
Assuming the input arrays contain unique values, you can use np.searchsorted with its optional sorter option for a vectorized solution, like so -
arr2_sortidx = npArray2.argsort()
idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
out1 = arr2_sortidx[idx]
Sample run to verify output -
In [154]: npArray1
Out[154]: array([77, 19, 0, 69])
In [155]: npArray2
Out[155]: array([ 8, 33, 12, 19, 77, 30, 81, 69, 20, 0])
In [156]: out = np.empty(npArray1.size,dtype=int)
...: for i,id in np.ndenumerate(npArray1):
...: out[i] = (np.where(npArray2==id))[0][0]
...:
In [157]: arr2_sortidx = npArray2.argsort()
...: idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
...: out1 = arr2_sortidx[idx]
...:
In [158]: out
Out[158]: array([4, 3, 9, 7])
In [159]: out1
Out[159]: array([4, 3, 9, 7])
Runtime test -
In [175]: def original_app(npArray1,npArray2):
...: out = np.empty(npArray1.size,dtype=int)
...: for i,id in np.ndenumerate(npArray1):
...: out[i] = (np.where(npArray2==id))[0][0]
...: return out
...:
...: def searchsorted_app(npArray1,npArray2):
...: arr2_sortidx = npArray2.argsort()
...: idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
...: return arr2_sortidx[idx]
...:
In [176]: # Setup inputs
...: M,N = 50000,40000 # npArray2 and npArray1 sizes respectively
...: maxn = 200000
...: npArray2 = np.unique(np.random.randint(0,maxn,(M)))
...: npArray2 = npArray2[np.random.permutation(npArray2.size)]
...: npArray1 = npArray2[np.random.permutation(npArray2.size)[:N]]
...:
In [177]: out1 = original_app(npArray1,npArray2)
In [178]: out2 = searchsorted_app(npArray1,npArray2)
In [179]: np.allclose(out1,out2)
Out[179]: True
In [180]: %timeit original_app(npArray1,npArray2)
1 loops, best of 3: 3.14 s per loop
In [181]: %timeit searchsorted_app(npArray1,npArray2)
100 loops, best of 3: 17.4 ms per loop
In the task you specified you have to iterate over the array one way or another. So you can just think of a considerable performance improvement without changing your algorithm too much. This is where numba might be of a great help:
import numpy as np
from numba import jit
#jit
def numba_iter(npa1, npa2):
for id in np.nditer(npa1):
newId=(np.where(npa2==id))[0][0]
This simple approach might make your program much faster. Look at some examples and benchmarks here.

NumPy: Calculate mean of certain elements in array

Assuming an (1-d) array, is it possible to calculate the average on given groups of diifferent size without looping? Instead of
avgs = [One_d_array[groups[i]].mean() for i in range(len(groups))]
Something like
avgs = np.mean(One_d_array, groups)
Basically I want to do this:
M = np.arange(10000)
np.random.shuffle(M)
M.resize(100,100)
groups = np.random.randint(1, 10, 100)
def means(M, groups):
means = []
for i, label in enumerate(groups):
means.extend([M[i][groups == j].mean() for j in set(p).difference([label])])
return means
This runs at
%timeit means(M, groups)
100 loops, best of 3: 12.2 ms per loop
Speed up of 10 times or so would be already great
Whether you see a loop or not, there is a loop.
Here's one way, but the loop is simply hidden in the call to map:
In [10]: import numpy as np
In [11]: groups = [[1,2],[3,4,5]]
In [12]: map(np.mean, groups)
Out[12]: [1.5, 4.0]
Another hidden loop is the use of np.vectorize:
>>> x = np.array([1,2,3,4,5])
>>> groups = [[0,1,2], [3,4]]
>>> np.vectorize(lambda group: np.mean(x[group]), otypes=[float])(groups)
array([ 2. , 4.5])

algorithm to compare two lists and get same elements in python

I have to lists, which has some common elements in them:
p = [('link1/d/b/c', 'target1/d/b/c'), ('link2/a/g/c', 'target2/a/g/c'), ..., ('linkn/b/b/f', 'targetn/b/b/f')]
q = [['target1/d/b/c', 'target1', 123, 334], ['targetn/b/b/f', 'targetn', 23, 64], ... ,['targetx/f/f/f', 'targetx', 999, 888]]
Im trying to compare them and find common elements, and then do some job with result:
do_job('target1/d/b/c', 'target1', 123, 334, 'link1/d/b/c')
for now im using simple and very slow alghortihm:
for item in p:
link = item[0]
target = item[1]
for item2 in q:
target2 = item2[0]
if target2 == target:
do_some_job(...)
I thougth, that I need to compare this two lists and get create one list which will contain all elements, eg:
pq = [['target1/d/b/c', 'target1', 123, 334, 'link1/d/b/c'], ..., ['targetn/b/b/f', 'targetn', 23, 64, 'linkn/b/b/f']]
and then call do_some_job(pq) instead of calling it each time when I found same element
How to gain it ?
best regards
use chain() to flatten the two lists, and then use set() and intersection() to get the common elements.
In [78]: from itertools import chain
In [79]: p
Out[79]:
[('link1/d/b/c', 'target1/d/b/c'),
('link2/a/g/c', 'target2/a/g/c'),
('linkn/b/b/f', 'targetn/b/b/f')]
In [80]: q
Out[80]:
[['target1/d/b/c', 'target1', 123, 334],
['targetn/b/b/f', 'targetn', 23, 64],
['targetx/f/f/f', 'targetx', 999, 888]]
In [81]: set(chain(*p)).intersection(set(chain(*q)))
Out[81]: set(['target1/d/b/c', 'targetn/b/b/f'])
or use a list comprehension with short-circuiting :
In [86]: [j for i in p for j in i if j in (z for y in q for z in y)]
Out[86]: ['target1/d/b/c', 'targetn/b/b/f']
or using any():
In [87]: [j for i in p for j in i if any (j==z for y in q for z in y)]
Out[87]: ['target1/d/b/c', 'targetn/b/b/f']
timeit:
In [93]: %timeit set(chain(*p)).intersection(set(chain(*q)))
100000 loops, best of 3: 7.38 us per loop ## winner
In [94]: %timeit [j for i in p for j in i if j in (z for y in q for z in y)]
10000 loops, best of 3: 24.9 us per loop
In [95]: %timeit [j for i in p for j in i if any (j==z for y in q for z in y)]
10000 loops, best of 3: 27.4 us per loop
In [97]: %timeit [x for x in chain(*p) if x in chain(*q)]
10000 loops, best of 3: 12.6 us per loop
You should probably use a dictionary:
target_to_link = dict((v,k) for (k,v) in p)
for item in q:
args = item + [target_to_link[item[0]]
do_some_job(*args)
The target_to_link dictionary gives you the corresponding link from your target. Just make sure that you don't have several targets sharing the same link...
In the for loop, we just create a temporary list of arguments args that combine your item (eg, ['target1/d/b/c', 'target1', 123, 334]) with the corresponding link and we use the function(*args) syntax...
If you need to be looping on p instead, you can construct a dictionary like
target_to_args = dict((k[0],k[1:]) for k in q)
then do something like
for (link, target) in p:
args = [target] + target_to_args[target] + [link]
do_some_job(*args)
A list comprehension with chain should work:
[x for x in chain(*p) if x in chain(*q)]

Categories

Resources