Optimizing numpy concatenate operations for very large data sets

Optimizing numpy concatenate operations for very large data sets - python

I have a dictionary, with a very large number of keys (~300k and growing) and as values it has sets which also have a large number of items (~20k).
dictionary = {
1: {1, 2, 3},
2: {3, 4},
3: {5, 6},
4: {1, 5, 12, 13},
5: set()
}
What I want to achieve is create two arrays:
keys = [1 1 1 2 2 3 3 4 4 4 4]
items = [1 2 3 3 4 5 6 1 5 12 13]
Which basically represent a mapping of each item in each set along with its corresponding key.
I tried using numpy for this job, but it still takes a very long time and I want to know if it can be optimized.
numpy code:
keys = np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), dictionary.items())))
items = np.concatenate(list(map(lambda x: list(x), dictionary.values())))
keys = np.array(keys, dtype=np.uint32)
items = np.array(items, dtype=np.uint16)
return keys, items
The second part is an attempt to try to reduce the memory footprint of those variables to account for their respective data types. But I know they will still default to 64bit variables in the first two operations (before applying the dtype change), so the memory will get allocated and I might run out of RAM.

not sure if it will perform much better but straight forward way to do it is like this
import numpy as np
keys = np.array(list(dictionary.keys()), dtype=np.uint32).repeat([len(s) for s in dictionary.values()])
values = np.concatenate([np.array(list(s), np.uint16) for s in dictionary.values()])
display(keys)
display(values)

For this small sample, a pure list version is considerably faster than the numpy one:
In [14]: timeit list(itertools.chain.from_iterable([[item[0]]*len(item[1]) for item in dictionary.items()]))
2.71 µs ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: timeit np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), dictionary.items())))
52.2 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and
In [24]: list(itertools.chain.from_iterable(dictionary.values()))
Out[24]: [1, 2, 3, 3, 4, 5, 6, 1, 13, 12, 5]
In [25]: timeit list(itertools.chain.from_iterable(dictionary.values()))
876 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [26]: timeit np.concatenate(list(map(lambda x: list(x), dictionary.values())))
13.8 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And Paul Panzer's version:
In [41]: timeit np.fromiter(itertools.chain.from_iterable(dictionary.values()),'int32')
3.69 µs ± 9.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It's probably better to use np.fromiter here. It is certainly easier on the memory as it avoids creating all those temporaries.
Timings:
import numpy as np
import itertools as it
from simple_benchmark import BenchmarkBuilder
B = BenchmarkBuilder()
#B.add_function()
def pp(a):
szs = np.fromiter(map(len,a.values()),int,len(a))
ks = np.fromiter(a.keys(),np.uint32,len(a)).repeat(szs)
vls = np.fromiter(it.chain.from_iterable(a.values()),np.uint16,ks.size)
return ks,vls
#B.add_function()
def OP(a):
keys = np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), a.items())))
items = np.concatenate(list(map(list, a.values())))
return keys, items
#B.add_function()
def DevKhadka(a):
keys = np.array(list(a.keys()), dtype=np.uint32).repeat([len(s) for s in a.values()])
values = np.concatenate([np.array(list(s), np.uint16) for s in a.values()])
return keys,values
#B.add_function()
def hpaulj(a):
ks = list(it.chain.from_iterable([[item[0]]*len(item[1]) for item in a.items()]))
vls = list(it.chain.from_iterable(a.values()))
return ks,vls
#B.add_arguments('total no items')
def argument_provider():
for exp in range(1,12):
sz = 2**exp
a = {j:set(np.random.randint(1,2**16,np.random.randint(1,sz)).tolist())
for j in range(1,10*sz)}
yield sum(map(len,a.values())),a
r = B.run()
r.plot()
import pylab
pylab.savefig('dct2np.png')

Related

Python: get maximum occurrence in array

I implemented codes to try to get maximum occurrence in numpy array. I was satisfactory using numba, but got limitations. I wonder whether it can be improved to a general case.
numba implementation
import numba as nb
import numpy as np
import collections
#nb.njit("int64(int64[:])")
def max_count_unique_num(x):
"""
Counts maximum number of unique integer in x.
Args:
x (numpy array): Integer array.
Returns:
Int
"""
# get maximum value
m = x[0]
for v in x:
if v > m:
m = v
if m == 0:
return x.size
# count each unique value
num = np.zeros(m + 1, dtype=x.dtype)
for k in x:
num[k] += 1
# maximum count
m = 0
for k in num:
if k > m:
m = k
return m
For comparisons, I also implemented numpy's unique and collections.Counter
def np_unique(x):
""" Counts maximum occurrence using numpy's unique. """
ux, uc = np.unique(x, return_counts=True)
return uc.max()
def counter(x):
""" Counts maximum occurrence using collections.Counter. """
counts = collections.Counter(x)
return max(counts.values())
timeit
Edit: Add np.bincount for additional comparison, as suggested by #MechanicPig.
In [1]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [2]: %timeit max_count_unique_num(x)
30 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: %timeit np_unique(x)
1.14 ms ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit counter(x)
2.68 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [6]: %timeit counter(x)
3.07 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np_unique(x)
1.3 ms ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit max_count_unique_num(x)
490 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [10]: %timeit np.bincount(x).max()
32.3 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [11]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [12]: %timeit np.bincount(x).max()
830 µs ± 6.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The limitations of numba implementation are quite obvious: efficiency only when all values in x are small positive int and will be significantly reduced for very large int; not applicable to float and negative values.
Any way I can generalize the implementation and keep the speed?
Update
After checking the source code of np.unique, an implementation for general cases can be:
#nb.njit(["int64(int64[:])", "int64(float64[:])"])
def max_count_unique_num_2(x):
x.sort()
n = 0
k = 0
x0 = x[0]
for v in x:
if x0 == v:
k += 1
else:
if k > n:
n = k
k = 1
x0 = v
# for last item in x if it equals to previous one
if k > n:
n = k
return n
timeit
In [154]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [155]: %timeit max_count_unique_num(x)
519 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [156]: %timeit np_unique(x)
1.3 ms ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [157]: %timeit max_count_unique_num_2(x)
240 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [158]: x = np.random.randint(0, 200000, size=300000).astype(np.int64)
In [159]: %timeit max_count_unique_num(x)
1.01 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [160]: %timeit np_unique(x)
18.1 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit max_count_unique_num_2(x)
3.58 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So:
If large integer in x and the size is not large, max_count_unique_num_2 beats max_count_unique_num.
Both max_count_unique_num and max_count_unique_num_2 are significantly faster than np.unique.
Small modification on max_count_unique_num_2 can return the item that has maximum occurrence, even all items having same maximum occurrence.
max_count_unique_num_2 can even be accelerated if x is itself sorted by removing x.sort().

What if shortening your code:
#nb.njit("int64(int64[:])", fastmath=True)
def shortened(x):
num = np.zeros(x.max() + 1, dtype=x.dtype)
for k in x:
num[k] += 1
return num.max()
or paralleled:
#nb.njit("int64(int64[:])", parallel=True, fastmath=True)
def shortened_paralleled(x):
num = np.zeros(x.max() + 1, dtype=x.dtype)
for k in nb.prange(x.size):
num[x[k]] += 1
return num.max()
Parallelizing will beat for larger data sizes. Note that parallel will get different result in some runs and need to be cured if be possible.
For handling the floats (or negative values) using Numba:
#nb.njit("int8(float64[:])", fastmath=True)
def shortened_float(x):
num = np.zeros(x.size, dtype=np.int8)
for k in x:
for j in range(x.shape[0]):
if k == x[j]:
num[j] += 1
return num.max()
IMO, np.unique(x, return_counts=True)[1].max() is the best choice which handle both integers and floats in a very fast implementation. Numba can be faster for integers (it depends on the data sizes as larger data sizes weaker performance; AIK, it is due to looping instinct than arrays), but for floats the code must be optimized in terms of performance if it could; But I don't think that Numba can beat NumPy unique, particularly when we faced to large data.
Notes: np.bincount can handle just integers.

You can do that without using numpy too.
arr = [1,1,2,2,3,3,4,5,6,1,3,5,7,1]
counts = list(map(list(arr).count, set(arr)))
list(set(arr))[counts.index(max(counts))]
If you want to use numpy then try this,
arr = np.array([1,1,2,2,3,3,4,5,6,1,3,5,7,1])
uniques, counts = np.unique(arr, return_counts = True)
uniques[np.where(counts == counts.max())]
Both do the exact same job. To check which method is more efficient just do this,
time_i = time.time()
<arr declaration> # Creating a new array each iteration can cause the total time to increase which would be biased against the numpy method.
for i in range(10**5):
<method you want>
time_f = time.time()
When I ran this I got 0.39 seconds for the first method and 2.69 for the second one. So it's pretty safe to say that the first method is more efficient.

What I want to say is that your implementation is almost the same as numpy.bincount. If you want to make it universal, you can consider encoding the original data:
def encode(ar):
# Equivalent to numpy.unique(ar, return_inverse=True)[1] when ar.ndim == 1
flatten = ar.ravel()
perm = flatten.argsort()
sort = flatten[perm]
mask = np.concatenate(([False], sort[1:] != sort[:-1]))
encoded = np.empty(sort.shape, np.int64)
encoded[perm] = mask.cumsum()
encoded.shape = ar.shape
return encoded
def count_max(ar):
return max_count_unique_num(encode(ar))

How do I efficiently find which elements of a list are in another list?

I want to know which elements of list_1 are in list_2. I need the output as an ordered list of booleans. But I want to avoid for loops, because both lists have over 2 million elements.
This is what I have and it works, but it's too slow:
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
booleans = []
for i in list_1:
booleans.append(i in list_2)
# booleans = [False, False, True, True, False, False]
I could split the list and use multithreading, but I would prefer a simpler solution if possible. I know some functions like sum() use vector operations. I am looking for something similar.
How can I make my code more efficient?

I thought it would be useful to actually time some of the solutions presented here on a larger sample input. For this input and on my machine, I find Cardstdani's approach to be the fastest, followed by the numpy isin() approach.
Setup 1
import random
list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(1, 10_000) for i in range(100_000)]
Setup 2
list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(10_001, 20_000) for i in range(100_000)]
Timings - ordered from fastest to slowest (setup 1).
Cardstdani - approach 1
I recommend converting Cardstdani's approach into a list comprehension (see this question for why list comprehensions are faster)
s = set(list_2)
booleans = [i in s for i in list_1]
# setup 1
6.01 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.19 ms ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
No list comprehension
s = set(list_2)
booleans = []
for i in list_1:
booleans.append(i in s)
# setup 1
7.28 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
5.87 ms ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Cardstdani - approach 2 (with an assist from Timus)
common = set(list_1) & set(list_2)
booleans = [item in common for item in list_1]
# setup 1
8.3 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
6.01 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Using the set intersection method
common = set(list_1).intersection(list_2)
booleans = [item in common for item in list_1]
# setup 1
10.1 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.82 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
numpy approach (crissal)
a1 = np.array(list_1)
a2 = np.array(list_2)
a = np.isin(a1, a2)
# setup 1
18.6 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
18.2 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2 (assuming list_1, list_2 already numpy arrays)
10.3 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
list comprehension
l = [i in list_2 for i in list_1]
# setup 1
4.85 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.6 s ± 823 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sharim - approach 1
booleans = list(map(lambda e: e in list_2, list_1))
# setup 1
4.88 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48 s ± 389 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using the __contains__ method
booleans = list(map(list_2.__contains__, list_1))
# setup 1
4.87 s ± 5.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.2 s ± 486 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sharim - approach 2
set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, list_1))
# setup 1
5.46 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
11.1 s ± 75.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Varying the length of the input
Employing the following setup
import random
list_1 = [random.randint(1, n) for i in range(n)]
list_2 = [random.randint(1, n) for i in range(n)]
and varying n in [2 ** k for k in range(18)]:
Employing the following setup
import random
list_1 = [random.randint(1, n ** 2) for i in range(n)]
list_2 = [random.randint(1, n ** 2) for i in range(n)]
and varying n in [2 ** k for k in range(18)], we obtain similar results:
Employing the following setup
list_1 = list(range(n))
list_2 = list(range(n, 2 * n))
and varying n in [2 ** k for k in range(18)]:
Employing the following setup
import random
list_1 = [random.randint(1, n) for i in range(10 * n)]
list_2 = [random.randint(1, n) for i in range(10 * n)]
and varying n in [2 ** k for k in range(18)]:

You can take advantage of the O(1) in operator complexity for the set() function to make your for loop more efficient, so your final algorithm would run in O(n) time, instead of O(n*n):
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
s = set(list_2)
booleans = []
for i in list_1:
booleans.append(i in s)
print(booleans)
It is even faster as a list comprehension:
s = set(list_2)
booleans = [i in s for i in list_1]
If you only want to know the elements, you can use an intersection of sets like that, which will be an efficient solution due to the use of set() function, already optimized by other Python engineers:
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
print(set(list_1).intersection(set(list_2)))
Output:
{1, 2}
Also, to provide a list format output, you can turn your resulting set into a list with list() function:
print(list(set(list_1).intersection(set(list_2))))

If you want to use a vector approach you can also use Numpy isin. It's not the fastest method, as demonstrated by oda's excellent post, but it's definitely an alternative to consider.
import numpy as np
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
a1 = np.array(list_1)
a2 = np.array(list_2)
np.isin(a1, a2)
# array([False, False, True, True, False, False])

You can use the map function.
Inside map I use the lambda function. If you are not familiar with the lambda function then you can check this out.
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
booleans = list(map(lambda e:e in list_2,iter(list_1)))
print(booleans)
output
[False, False, True, True, False, False]
However, if you want the only elements which are not the same then instead of a map function you can use the filter function with the same code.
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
new_lst = list(filter(lambda e:e in list_2,iter(list_1)))# edited instead of map use filter.
print(new_lst)
output
[1, 2]
Edited
I am removing the in statement from the code because in also acts as a loop. I am checking using the timeit module.
you can use this code for the list containing True and False.
This way is fastest then above one.
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)
booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))
print(booleans)
output
[False, False, True, True, False, False]
This one is for the list containing the elements.
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)
booleans = list(filter(lambda e:set_2!=set_2-{e},iter(list_1))) # edited instead of map use filter
print(booleans)
output
[1,2]
Because OP don't want to use lambda function then this.
list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)
def func():
return set_2!=set_2-{e}
booleans = list(map(func,iter(list_1)))
I know my way isn't the best way to this answer this because I am never using NumPy much.

It's probably simpler to just use the built-in set intersection method, but if you have lots of lists that you're comparing, it might be faster to sort the lists. Sorting the list is n ln n, but once you have them sorted, you can compare them in linear time by checking whether the elements match, and if they don't, advance to the next item in the list whose current element is smaller.

Use set() to get a list of unique items in each list
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
booleans = []
set_1 = set(list_1)
set_2 = set(list_2)
if(set_1 & set_2):
print(set_1 & set_2)
else:
print("No common elements")
Output:
{1, 2}

If you know the values are non-negative and the maximum value is much smaller than the length of the list, then using numpy's bincount might be a good alternative for using a set.
np.bincount(list_1).astype(bool)[list_2]
If list_1 and list_2 happen to be numpy arrays, this can even be a lot faster than the set + list-comprehension solution. (In my test 263 µs vs 7.37 ms; but if they're python lists, it's slightly slower than the set solution, with 8.07 ms)

Spybug96's method will work best and fastest. If you want to get an indented object with the common elements of the two sets you can use the tuple() function on the final set:
a = set(range(1, 6))
b = set(range(3, 9))
c = a & b
print(tuple(c))

How can I transpose an array of values into a matching array of indices using numpy vectorization?

I have a numpy array of sample pairs (2-D) and an array of samples (1-D). I want to convert the sample pairs to a matching array (i.e. 2-D) representing the indices of the sample array. Is there a faster solution than what I have already employed?
import numpy as np
pair_list = np.array([['samp1', 'samp4'],
['samp2', 'samp7'],
['samp2', 'samp4']])
samples = np.array(['samp0', 'samp1', 'samp2', 'samp3', 'samp4', 'samp5',
'samp6', 'samp7', 'samp8', 'samp9'])
vfunc = np.vectorize(lambda s: np.where(samples == s)[0])
pair_indices = vfunc(pair_list)
In [180]: print(pair_indices)
[[1 4]
[2 7]
[2 4]]

I suggest you to use dictionaries because of its performant time complexity.
>>> import numpy as np
>>> pair_list = np.array([['samp1', 'samp4'],
['samp2', 'samp7'],
['samp2', 'samp4']])
>>> samples = {'samp0':0, 'samp1':1, 'samp2':2, 'samp3':3, 'samp4':4, 'samp5':5,
'samp6':6, 'samp7':7, 'samp8':8, 'samp9':9}
>>> vfunc = np.vectorize(lambda x: samples[x])
>>> pair_indices = vfunc(pair_list)
>>> print(pair_indices)
[[1 4]
[2 7]
[2 4]]

pair_list = np.array([['samp1', 'samp4'],
['samp2', 'samp7'],
['samp2', 'samp4']])
samples = np.array(['samp0', 'samp1', 'samp2', 'samp3', 'samp4', 'samp5',
'samp6', 'samp7', 'samp8', 'samp9'])
def f1(pair_list,samples):
vfunc = np.vectorize(lambda s: np.where(samples == s)[0])
return vfunc(pair_list)
def f2(pair_list,samples):
d = dict()
for idx,el in enumerate(samples): d[el]=idx
return np.array([d[el] for row in pair_list for el in row]).reshape(pair_list.shape[0],2)
f2 looks clumsy, but...
timeit f1(pair_list,samples)
25.7 µs ± 78 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
timeit f2(pair_list,samples)
9.09 µs ± 68.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Try it on your machine and see how it goes for you! Of course, it'll be even better if you have the ability to reuse samples, since in that case you only have to convert samples to a dict once.
Edit: It's much, much better to vectorize dict access, as suggested by Mohsen_Fatemi, even if samples can't be reused.
def f3(pair_list,samples):
d = dict()
for idx,el in enumerate(samples): d[el]=idx
vfunc = np.vectorize(lambda x: d[x])
return vfunc(pair_list)
timeit f3
16.1 ns ± 0.0138 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

How to add two neighboring elements into a new one in numpy array [duplicate]

This question already has answers here:
How can I sum every n array values and place the result into a new array? [duplicate]
(3 answers)
Closed 3 years ago.
For example,
I have a numpy array containing:
[1, 2, 3, 4, 5, 6]
I want to create an array as follows:
[3, 7, 11]
That is, I want to add the two neighboring elements into a new one.
I have tried the obvious:
for i in range(0, predictions.shape[0]+1, 2):
new_pred = np.append(new_pred, (predictions[i] + predictions[i+1]) / 2)
print(predictions.shape)
(16000, 0)
print(new_pred.shape)
(87998, 0)
But the dimension of new_pred is not half of 16000.
So I am wondering is there anything wrong with my code? And is there a convenient way to implement it?

There are many different possibilities, here it is one, neither the slowest one nor the fastest, of them,
>>> import numpy as np
>>> a = np.arange(30)
>>> a.reshape(-1, 2).sum(axis=1)
array([ 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57])
>>>
For the record (please note that we have a new fastest answer that, imho, can't be bettered at all)
In [17]: a = np.arange(10**5)
In [18]: %timeit a.reshape(-1,2).sum(axis=1)
1.08 ms ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [19]: %timeit [(a[i]+ a[i+1]) for i in range(0, len(a-1), 2)]
23.4 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [20]: %timeit [sum(item) for ind, item in enumerate(zip(a, a[1:])) if ind%2 == 0]
49.9 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [21]: %timeit [sum(item) for item in zip(a[::2], a[1::2])]
30.2 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
...
In [23]: %timeit a[::2]+a[1::2]
78.9 µs ± 79.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Use slices of ndarray:
predictions[::2] + predictions[1::2]
It is 10 times faster than "reshape" solution
>>> a = np.arange(10**5)
>>> timeit(lambda: a.reshape(-1,2).sum(axis=-1), number=1000)
0.785971520585008
>>> timeit(lambda: a[::2]+a[1::2], number=1000)
0.07569492445327342

another pythonic Possibility would be to use list comprehensions:
something like this for the example you posted:
import numpy as np
a = np.arange(1, 7)
res = [(a[i]+ a[i+1]) for i in range(0, len(a-1), 2)]
print(res)
hope it helps

Using zip
zip_ls = zip(ls[::2], ls[1::2])
new_ls = [sum(item) for item in zip_ls]

how to get pandas series sorted position [duplicate]

I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?

>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.

Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.

In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']

Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)

you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>

This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64

Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.

Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.

df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing numpy concatenate operations for very large data sets - python

Related

Python: get maximum occurrence in array

How do I efficiently find which elements of a list are in another list?

How can I transpose an array of values into a matching array of indices using numpy vectorization?

How to add two neighboring elements into a new one in numpy array [duplicate]

how to get pandas series sorted position [duplicate]

Categories

Resources