It should not be so hard. I mean in C,
int a[10];
is all you need. How to create an array of all zeros for a random size. I know the zeros() function in NumPy but there must be an easy way built-in, not another module.
If you are not satisfied with lists (because they can contain anything and take up too much memory) you can use efficient array of integers:
import array
array.array('i')
See here
If you need to initialize it,
a = array.array('i',(0 for i in range(0,10)))
two ways:
x = [0] * 10
x = [0 for i in xrange(10)]
Edit: replaced range by xrange to avoid creating another list.
Also: as many others have noted including Pi and Ben James, this creates a list, not a Python array. While a list is in many cases sufficient and easy enough, for performance critical uses (e.g. when duplicated in thousands of objects) you could look into python arrays. Look up the array module, as explained in the other answers in this thread.
>>> a = [0] * 10
>>> a
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Use the array module. With it you can store collections of the same type efficiently.
>>> import array
>>> import itertools
>>> a = array_of_signed_ints = array.array("i", itertools.repeat(0, 10))
For more information - e.g. different types, look at the documentation of the array module. For up to 1 million entries this should feel pretty snappy. For 10 million entries my local machine thinks for 1.5 seconds.
The second parameter to array.array is a generator, which constructs the defined sequence as it is read. This way, the array module can consume the zeros one-by-one, but the generator only uses constant memory. This generator does not get bigger (memory-wise) if the sequence gets longer. The array will grow of course, but that should be obvious.
You use it just like a list:
>>> a.append(1)
>>> a.extend([1, 2, 3])
>>> a[-4:]
array('i', [1, 1, 2, 3])
>>> len(a)
14
...or simply convert it to a list:
>>> l = list(a)
>>> len(l)
14
Surprisingly
>>> a = [0] * 10000000
is faster at construction than the array method. Go figure! :)
import numpy as np
new_array=np.linspace(0,10,11).astype('int')
An alternative for casting the type when the array is made.
a = 10 * [0]
gives you an array of length 10, filled with zeroes.
import random
def random_zeroes(max_size):
"Create a list of zeros for a random size (up to max_size)."
a = []
for i in xrange(random.randrange(max_size)):
a += [0]
Use range instead if you are using Python 3.x.
If you need to initialize an array fast, you might do it by blocks instead of with a generator initializer, and it's going to be much faster. Creating a list by [0]*count is just as fast, still.
import array
def zerofill(arr, count):
count *= arr.itemsize
blocksize = 1024
blocks, rest = divmod(count, blocksize)
for _ in xrange(blocks):
arr.fromstring("\x00"*blocksize)
arr.fromstring("\x00"*rest)
def test_zerofill(count):
iarr = array.array('i')
zerofill(iarr, count)
assert len(iarr) == count
def test_generator(count):
iarr = array.array('i', (0 for _ in xrange(count)))
assert len(iarr) == count
def test_list(count):
L = [0]*count
assert len(L) == count
if __name__ == '__main__':
import timeit
c = 100000
n = 10
print timeit.Timer("test(c)", "from __main__ import c, test_zerofill as test").repeat(number=n)
print timeit.Timer("test(c)", "from __main__ import c, test_generator as test").repeat(number=n)
print timeit.Timer("test(c)", "from __main__ import c, test_list as test").repeat(number=n)
Results:
(array in blocks) [0.022809982299804688, 0.014942169189453125, 0.014089107513427734]
(array with generator) [1.1884641647338867, 1.1728270053863525, 1.1622772216796875]
(list) [0.023866891860961914, 0.035660028457641602, 0.023386955261230469]
Related
I am writing a function that pulls the top x values from a sparse vector (fewer values if there are less than x). I would like to include an "in-place" option like many functions have, where it removes the top values if the option is True, and keeps them if the option is False.
My issue is that my current function is overwriting the input vector, rather than keeping it as is. I am not sure why this is occurring. I expected the way to solve my design problem would be to include an if statement which would copy the input using copy.copy(), but this raises a value error (ValueError: row index exceeds matrix dimensions) which does not make sense to me.
Code:
from scipy.sparse import csr_matrix
import copy
max_loc=20
data=[1,3,3,2,5]
rows=[0]*len(data)
indices=[4,2,8,12,7]
sparse_test=csr_matrix((data, (rows,indices)), shape=(1,max_loc))
print(sparse_test)
def top_x_in_sparse(in_vect,top_x,inplace=False):
if inplace==True:
rvect=in_vect
else:
rvect=copy.copy(in_vect)
newmax=top_x
count=0
out_list=[]
while newmax>0:
newmax=1
if count<top_x:
out_list+=[csr_matrix.max(rvect)]
remove=csr_matrix.argmax(rvect)
rvect[0,remove]=0
rvect.eliminate_zeros()
newmax=csr_matrix.max(rvect)
count+=1
else:
newmax=0
return out_list
a=top_x_in_sparse(sparse_test,3)
print(a)
print(sparse_test)
My question has two parts:
how do I prevent this function from overwriting the vector?
how do I add the inplace option?
You really just don't want to loop period. Reallocating those arrays every loop iteration with .eliminate_zeros() is the slowest thing but not the only reason not do to it.
import numpy as np
from scipy.sparse import csr_matrix
max_loc=20
data=[1,3,3,2,5]
rows=[0]*len(data)
indices=[4,2,8,12,7]
sparse_test=csr_matrix((data, (rows,indices)), shape=(1,max_loc))
Something like this would be better:
def top_x_in_sparse(in_vect,top_x,inplace=False):
n = len(in_vect.data)
if top_x >= n:
if inplace:
out_data = in_vect.data.tolist()
in_vect.data = np.array([], dtype=in_vect.data.dtype)
in_vect.indices = np.array([], dtype=in_vect.indices.dtype)
in_vect.indptr = np.array([0, 0], dtype=in_vect.indptr.dtype)
return out_data
else:
return in_vect.data.tolist()
else:
k = n - top_x
partition_idx = np.argpartition(in_vect.data, k)
if inplace:
out_data = in_vect.data[partition_idx[k:n]].tolist()
in_vect.data = in_vect.data[partition_idx[0:k]]
in_vect.indices = in_vect.indices[partition_idx[0:k]]
in_vect.indptr = np.array([0, len(in_vect.data)], dtype=in_vect.indptr.dtype)
return out_data
else:
return in_vect.data[partition_idx[k:n]].tolist()
If you need the returned values sorted you can do that as well of course.
a=top_x_in_sparse(sparse_test,3,inplace=False)
>>> print(a)
[3, 5, 3]
>>> print(sparse_test)
(0, 2) 3
(0, 4) 1
(0, 7) 5
(0, 8) 3
(0, 12) 2
b=top_x_in_sparse(sparse_test,3,inplace=True)
>>> print(b)
[3, 5, 3]
>>> print(sparse_test)
(0, 4) 1
(0, 12) 2
Also as per your question from the comments: a shallow copy of the sparse array object will not copy the numpy arrays that hold the data. The sparse object only has a reference to those objects. A deep copy would get them, but using the built-in method for copy already takes care of knowing which things that are referenced need to be copied and which dont.
I want to create a random number using n numbers which are between i and j. For instance, for n=10 and i=1 and j=5, such an output is expected: 2414243211. I did it in R using this code:
paste(floor(runif(10,1,5)),collapse="") #runif create 10 random number between 1 and 5 and floor make them as integer and finally paste makes them as a sequence of numbers instead of array.
I want to do the same in Python. I found random.uniform but it generates 1 number and I don't want to use loops.
import random
import math
math.floor(random.uniform(1,5)) #just generate 1 number between 1 and 5
update:
i and j are integers between 0 and 9, while n could be any integer.
i and j decide which number can be used in the string while n indicates the length of the numeric string.
If I understand your question (not sure I do), and you have Python 3.6, you can use random.choices:
>>> from random import choices
>>> int(''.join(map(str, choices(range(1, 5), k=10))))
2121233233
The random.choices() function does what you want:
>>> from random import choices
>>> n, i, j = 10, 1, 5
>>> population = list(map(str, range(i, j+1)))
>>> ''.join(choices(population, k=n))
'5143113531'
If you consider list-comprehensions being loops (which they actually in many ways are) there you will not be satisfied with this but I will try my luck:
from random import randint
res = ''.join([str(randint(1, 5)) for _ in range(10)])
print(res) #-> 4353344154
Notes:
The result is a string! If you want an integer, cast to int.
randint works incluselively; the start (1) and end (5) might be produced and returned. If you do not want that, modify them (start = 2 and end = 4)
Is there a reason you are using random.uniform (and subsequently math.floor()) and not simply randint?
x = ''.join([str(math.floor(random.uniform(1,5))) for i in range(10)])
After many attempts trying optimize code, it seems that one last resource would be to attempt to run the code below using multiple cores. I don't know exactly how to convert/re-structure my code so that it can run much faster using multiple cores. I will appreciate if I could get guidance to achieve the end goal. The end goal is to be able to run this code as fast as possible for arrays A and B where each array holds about 700,000 elements. Here is the code using small arrays. The 700k element arrays are commented out.
import numpy as np
def ismember(a,b):
for i in a:
index = np.where(b==i)[0]
if index.size == 0:
yield 0
else:
yield index
def f(A, gen_obj):
my_array = np.arange(len(A))
for i in my_array:
my_array[i] = gen_obj.next()
return my_array
#A = np.arange(700000)
#B = np.arange(700000)
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
gen_obj = ismember(A,B)
f(A, gen_obj)
print 'done'
# if we print f(A, gen_obj) the output will be: [4 0 0 4 3]
# notice that the output array needs to be kept the same size as array A.
What I am trying to do is to mimic a MATLAB function called ismember[2] (The one that is formatted as: [Lia,Locb] = ismember(A,B). I am just trying to get the Locb part only.
From Matlab: Locb, contain the lowest index in B for each value in A that is a member of B. The output array, Locb, contains 0 wherever A is not a member of B
One of the main problems is that I need to be able to perform this operation as efficient as possible. For testing I have two arrays of 700k elements. Creating a generator and going through the values of the generator doesn't seem to get the job done fast.
Before worrying about multiple cores, I would eliminate the linear scan in your ismember function by using a dictionary:
def ismember(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
Your original implementation requires a full scan of the elements in B for each element in A, making it O(len(A)*len(B)). The above code requires one full scan of B to generate the dict Bset. By using a dict, you effectively make the lookup of each element in B constant for each element of A, making the operation O(len(A)+len(B)). If this is still too slow, then worry about making the above function run on multiple cores.
Edit: I've also modified your indexing slightly. Matlab uses 0 because all of its arrays start at index 1. Python/numpy start arrays at 0, so if you're data set looks like this
A = [2378, 2378, 2378, 2378]
B = [2378, 2379]
and you return 0 for no element, then your results will exclude all elements of A. The above routine returns None for no index instead of 0. Returning -1 is an option, but Python will interpret that to be the last element in the array. None will raise an exception if it's used as an index into the array. If you'd like different behavior, change the second argument in the Bind.get(item,None) expression to the value you want returned.
sfstewman's excellent answer most likely solved the issue for you.
I'd just like to add how you can achieve the same exclusively in numpy.
I make use of numpy's unique an in1d functions.
B_unique_sorted, B_idx = np.unique(B, return_index=True)
B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)
B_unique_sorted contains the unique values in B sorted.
B_idx holds for these values the indices into the original B.
B_in_A_bool is a boolean array the size of B_unique_sorted that
stores whether a value in B_unique_sorted is in A.
Note: I need to look for (unique vals from B) in A because I need the output to be returned with respect to B_idx
Note: I assume that A is already unique.
Now you can use B_in_A_bool to either get the common vals
B_unique_sorted[B_in_A_bool]
and their respective indices in the original B
B_idx[B_in_A_bool]
Finally, I assume that this is significantly faster than the pure Python for-loop although I didn't test it.
Try the ismember library.
pip install ismember
Simple example:
# Import library
from ismember import ismember
import numpy as np
# data
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])
# Lookup
Iloc,idx = ismember(A, B)
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [ True False False True True]
# indexes of d_unique that exists in d
print(idx)
# [4 4 3]
print(B[idx])
# [3 3 6]
print(A[Iloc])
# [3 3 6]
# These vectors will match
A[Iloc]==B[idx]
Speed check:
from ismember import ismember
from datetime import datetime
t1=[]
t2=[]
# Create some random vectors
ns = np.random.randint(10,10000,1000)
for n in ns:
a_vec = np.random.randint(0,100,n)
b_vec = np.random.randint(0,100,n)
# Run stack version
start = datetime.now()
out1=ismember_stack(a_vec, b_vec)
end = datetime.now()
t1.append(end - start)
# Run ismember
start = datetime.now()
out2=ismember(a_vec, b_vec)
end = datetime.now()
t2.append(end - start)
print(np.sum(t1))
# 0:00:07.778331
print(np.sum(t2))
# 0:00:04.609801
# %%
def ismember_stack(a, b):
bind = {}
for i, elt in enumerate(b):
if elt not in bind:
bind[elt] = i
return [bind.get(itm, None) for itm in a] # None can be replaced by any other "not in b" value
The ismember function from pypi is almost 2x faster.
Large vectors, eg 700000 elements:
from ismember import ismember
from datetime import datetime
A = np.random.randint(0,100,700000)
B = np.random.randint(0,100,700000)
# Lookup
start = datetime.now()
Iloc,idx = ismember(A, B)
end = datetime.now()
# Print time
print(end-start)
# 0:00:01.194801
Try using a list comprehension;
In [1]: import numpy as np
In [2]: A = np.array([3,4,4,3,6])
In [3]: B = np.array([2,5,2,6,3])
In [4]: [x for x in A if not x in B]
Out[4]: [4, 4]
Generally, list comprehensions are much faster than for-loops.
To get an equal length-list;
In [19]: map(lambda x: x if x not in B else False, A)
Out[19]: [False, 4, 4, False, False]
This is quite fast for small datasets:
In [20]: C = np.arange(10000)
In [21]: D = np.arange(15000, 25000)
In [22]: %timeit map(lambda x: x if x not in D else False, C)
1 loops, best of 3: 756 ms per loop
For large datasets, you could try using a multiprocessing.Pool.map() to speed up the operation.
Here is the exact MATLAB equivalent that returns both the output arguments [Lia, Locb] that match MATLAB except in Python 0 is also a valid index. So, this function doesn't return the 0s. It essentially returns Locb(Locb>0). The performance is also equivalent to MATLAB.
def ismember(a_vec, b_vec):
""" MATLAB equivalent ismember function """
bool_ind = np.isin(a_vec,b_vec)
common = a[bool_ind]
common_unique, common_inv = np.unique(common, return_inverse=True) # common = common_unique[common_inv]
b_unique, b_ind = np.unique(b_vec, return_index=True) # b_unique = b_vec[b_ind]
common_ind = b_ind[np.isin(b_unique, common_unique, assume_unique=True)]
return bool_ind, common_ind[common_inv]
An alternate implementation that is a bit (~5x) slower but doesn't use the unique function is here:
def ismember(a_vec, b_vec):
''' MATLAB equivalent ismember function. Slower than above implementation'''
b_dict = {b_vec[i]: i for i in range(0, len(b_vec))}
indices = [b_dict.get(x) for x in a_vec if b_dict.get(x) is not None]
booleans = np.in1d(a_vec, b_vec)
return booleans, np.array(indices, dtype=int)
When appending longer statements to a list, I feel append becomes awkward to read. I would like a method that would work for dynamic list creation (i.e. don't need to initialize with zeros first, etc.), but I cannot seem to come up with another way of doing what I want.
Example:
import math
mylist = list()
phi = [1,2,3,4] # lets pretend this is of unknown/varying lengths
i, num, radius = 0, 4, 6
while i < num:
mylist.append(2*math.pi*radius*math.cos(phi[i]))
i = i + 1
Though append works just fine, I feel it is less clear than:
mylist[i] = 2*math.pi*radius*math.cos(phi[i])
But this does not work, as that element does not exist in the list yet, yielding:
IndexError: list assignment index out of range
I could just assign the resulting value to temporary variable, and append that, but that seems ugly and inefficient.
You don;t need an existing list and append to it later. Just use list comprehension
List comprehension,
is fast,
easy to comprehend,
and can easily be ported as a generator expression
>>> import math
>>> phi = [1,2,3,4]
>>> i, num, radius = 0, 4, 6
>>> circum = 2*math.pi*radius
>>> mylist = [circum * math.cos(p) for p in phi]
Reviewing your code, here are some generic suggestions
Do not compute a known constant in an iteration
while i < num:
mylist.append(2*math.pi*radius*math.cos(phi[i]))
i = i + 1
should be written as
circum = 2*math.pi
while i < num:
mylist.append(circum*math.cos(phi[i]))
i = i + 1
Instead of while use for-each construct
for p in phi:
mylist.append(circum*math.cos(p))
If an expression is not readable, break it into multiple statements, after all readability counts in Python.
In this particular case you could use a list comprehension:
mylist = [2*math.pi*radius*math.cos(phi[i]) for i in range(num)]
Or, if you're doing this sort of computations a lot, you could move away from using lists and use NumPy instead:
In [78]: import numpy as np
In [79]: phi = np.array([1, 2, 3, 4])
In [80]: radius = 6
In [81]: 2 * np.pi * radius * np.cos(phi)
Out[81]: array([ 20.36891706, -15.68836613, -37.32183785, -24.64178397])
I find this last version to be the most aesthetically pleasing of all. For longer phi it will also be more performant than using lists.
mylist += [2*math.pi*radius*math.cos(phi[i])]
you can use list concatenation, but append is twice as fast according to this:
import math
mylist = list()
phi = [1,2,3,4] # lets pretend this is of unknown/varying lengths
i, num, radius = 0, 4, 6
while i < num:
mylist += [(2*math.pi*radius*math.cos(phi[i]))]
i = i + 1
Given two sorted arrays like the following:
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
I would like the output to be:
c = array([1,2,3,4,5,6,7,8,9,10])
or:
c = array([1,2,3,4,4,5,6,7,8,9,10])
I'm aware that I can do the following:
c = unique(concatenate((a,b))
I'm just wondering if there is a faster way to do it as the arrays I'm dealing with have millions of elements.
Any idea is welcomed. Thanks
Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:
Do not use np.sort, use c.sort() method instead which sorts the array in place and avoids the copy.
np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag.
I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.
This would make the full thing close to what you got (without doing a unique beforehand):
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
Inserting elements into the middle of an array is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect. The complexity of doing so would be around O(N^2).
Your current approach is O(n*log(n)), so that's a lot better, but it's not perfect.
Inserting all the elements into a hash table (such as a set) is something. That's going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)). Still not great.
The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.
It would look vaguely like the following:
def foo(numpy.ndarray[int, ndim=1] out,
numpy.ndarray[int, ndim=1] in1,
numpy.ndarray[int, ndim=1] in2):
cdef int i = 0
cdef int j = 0
cdef int k = 0
while (i!=len(in1)) or (j!=len(in2)):
# set out[k] to smaller of in[i] or in[j]
# increment k
# increment one of i or j
When curious about timings, it's always best to just timeit. Below, i've listed a subset of the various methods and their timings:
import numpy as np
import timeit
import heapq
def insort(a, x, lo=0, hi=None):
if hi is None: hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo, np.insert(a, lo, [x])
size=10000
a = np.array(range(size))
b = np.array(range(size))
def op(a,b):
return np.unique(np.concatenate((a,b)))
def martijn(a,b):
c = np.copy(a)
lo = 0
for i in b:
lo, c = insort(c, i, lo)
return c
def martijn2(a,b):
c = np.zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
c[i] = v
def larsmans(a,b):
return np.array(sorted(set(a) | set(b)))
def larsmans_mod(a,b):
return np.array(set.union(set(a),b))
def sebastian(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
Results:
martijn2 25.1079499722
OP 1.44831800461
larsmans 9.91507601738
larsmans_mod 5.87612199783
sebastian 3.50475311279e-05
My specific contribution here is larsmans_mod which avoids creating 2 sets -- it only creates 1 and in doing so cuts execution time nearly in half.
EDIT removed martijn as it was too slow to compete. Also tested for slightly bigger arrays (sorted) input. I also have not tested for correctness in output ...
In addition to the other answer on using bisect.insort, if you are not content with performance, you may try using blist module with bisect. It should improve the performance.
Traditional list insertion complexity is O(n), while blist's complexity on insertion is O(log(n)).
Also, you arrays seem to be sorted. If so, you can use merge function from heapq mudule to utilize the fact that both arrays are presorted. This approach will take an overhead because of crating a new array in memory. It may be an option to consider as this solution's time complexity is O(n+m), while the solutions with insort are O(n*m) complexity (n elements * m insertions)
import heapq
a = [1,2,4,5,6,8,9]
b = [3,4,7,10]
it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
L = list(it) #list made of it
print(L)
Output:
[1, 2, 3, 4, 4, 5, 6, 7, 8, 9, 10]
If you want to delete repeating values, you can use groupby:
import heapq
import itertools
a = [1,2,4,5,6,8,9]
b = [3,4,7,10]
it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
it = (k for k,v in itertools.groupby(it))
L = list(it) #list made of it
print(L)
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
You could use the bisect module for such merges, merging the second python list into the first.
The bisect* functions work for numpy arrays but the insort* functions don't. It's easy enough to use the module source code to adapt the algorithm, it's quite basic:
from numpy import array, copy, insert
def insort(a, x, lo=0, hi=None):
if hi is None: hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo, insert(a, lo, [x])
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
c = copy(a)
lo = 0
for i in b:
lo, c = insort(c, i, lo)
Not that the custom insort is really adding anything here, the default bisect.bisect works just fine too:
import bisect
c = copy(a)
lo = 0
for i in b:
lo = bisect.bisect(c, i)
c = insert(c, i, lo)
Using this adapted insort is much more efficient than a combine and sort. Because b is sorted as well, we can track the lo insertion point and search for the next point starting there instead of considering the whole array each loop.
If you don't need to preserve a, just operate directly on that array and save yourself the copy.
More efficient still: because both lists are sorted, we can use heapq.merge:
from numpy import zeros
import heapq
c = zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
c[i] = v
Use the bisect module for this:
import bisect
a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])
for i in b:
pos = bisect.bisect(a, i)
insert(a,[pos],i)
I can't test this right now, but it should work
The sortednp package implements an efficient merge of sorted numpy-arrays, just sorting the values, not making them unique:
import numpy as np
import sortednp
a = np.array([1,2,4,5,6,8,9])
b = np.array([3,4,7,10])
c = sortednp.merge(a, b)
I measured the times and compared them in this answer to a similar post where it outperforms numpy's mergesort (v1.17.4).
Seems like no one mentioned union1d (union1d). Currently, it is a shortcut for unique(concatenate((ar1, ar2))), but its a short name to remember and it has a potential to be optimized by numpy developers since its a library function. It performs very similar to insort from seberg's accepted answer for large arrays. Here is my benchmark:
import numpy as np
def insort(a, b, kind='mergesort'):
# took mergesort as it seemed a tiny bit faster for my sorted large array try.
c = np.concatenate((a, b)) # we still need to do this unfortunatly.
c.sort(kind=kind)
flag = np.ones(len(c), dtype=bool)
np.not_equal(c[1:], c[:-1], out=flag[1:])
return c[flag]
size = int(1e7)
a = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
b = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
np.testing.assert_array_equal(insort(a, b), np.union1d(a, b))
import timeit
repetitions = 20
print("insort: %.5fs" % (timeit.timeit("insort(a, b)", "from __main__ import a, b, insort", number=repetitions)/repetitions,))
print("union1d: %.5fs" % (timeit.timeit("np.union1d(a, b)", "from __main__ import a, b; import numpy as np", number=repetitions)/repetitions,))
Output on my machine:
insort: 1.69962s
union1d: 1.66338s