Speed up comparison of floats between lists - python

I have a block of code which does the following:
take a float from a list, b_lst below, of index indx
check if this float is located between a float of index i and the next one (of index i+1) in list a_lst
if it is, then store indx in a sub-list of a third list (c_lst) where the index of that sub-list is the index of the left float in a_lst (ie: i)
repeat for all floats in b_lst
Here's a MWE which shows what the code does:
import numpy as np
import timeit
def random_data(N):
# Generate some random data.
return np.random.uniform(0., 10., N).tolist()
# Data lists.
# Note that a_lst is sorted.
a_lst = np.sort(random_data(1000))
b_lst = random_data(5000)
# Fixed index value (int)
c = 25
def func():
# Create empty list with as many sub-lists as elements present
# in a_lst beyond the 'c' index.
c_lst = [[] for _ in range(len(a_lst[c:])-1)]
# For each element in b_lst.
for indx,elem in enumerate(b_lst):
# For elements in a_lst beyond the 'c' index.
for i in range(len(a_lst[c:])-1):
# Check if 'elem' is between this a_lst element
# and the next.
if a_lst[c+i] < elem <= a_lst[c+(i+1)]:
# If it is then store the index of 'elem' ('indx')
# in the 'i' sub-list of c_lst.
c_lst[i].append(indx)
return c_lst
print func()
# time function.
func_time = timeit.timeit(func, number=10)
print func_time
This code works as it should but I really need to improve its performance since it's slowing down the rest of my code.
Add
This is the optimized function based on the accepted answer. It's quite ugly but it gets the job done.
def func_opt():
c_lst = [[] for _ in range(len(a_lst[c:])-1)]
c_opt = np.searchsorted(a_lst[c:], b_lst, side='left')
for elem in c_opt:
if 0<elem<len(a_lst[c:]):
c_lst[elem-1] = np.where(c_opt==elem)[0].tolist()
return c_lst
In my tests this is ~7x faster than the original function.
Add 2
Much faster not using np.where:
def func_opt2():
c_lst = [[] for _ in range(len(a_lst[c:])-1)]
c_opt = np.searchsorted(a_lst[c:], b_lst, side='left')
for indx,elem in enumerate(c_opt):
if 0<elem<len(a_lst[c:]):
c_lst[elem-1].append(indx)
return c_lst
This is ~130x faster than the original function.
Add 3
Following jtaylor's advice I converted the result of np.searchsorted to a list with .tolist():
def func_opt3():
c_lst = [[] for _ in range(len(a_lst[c:])-1)]
c_opt = np.searchsorted(a_lst[c:], b_lst, side='left').tolist()
for indx,elem in enumerate(c_opt):
if 0<elem<len(a_lst[c:]):
c_lst[elem-1].append(indx)
return c_lst
This is ~470x faster than the original function.

You want to take a look at numpy's searchsorted. Calling
np.searchsorted(a_lst, b_lst, side='right')
will return an array of indices, the same length as b_lst, holding before which item in a_lst they should be inserted to preserve order. It will be very fast, as it uses binary search and the looping happens in C. You could then create your subarrays with fancy indexing, e.g.:
>>> a = np.arange(1, 10)
>>> b = np.random.rand(100) * 10
>>> c = np.searchsorted(a, b, side='right')
>>> b[c == 0]
array([ 0.54620226, 0.40043875, 0.62398925, 0.40097674, 0.58765603,
0.14045264, 0.16990249, 0.78264088, 0.51507254, 0.31808327,
0.03895417, 0.92130027])
>>> b[c == 1]
array([ 1.34599709, 1.42645778, 1.13025996, 1.20096723, 1.75724448,
1.87447058, 1.23422399, 1.37807553, 1.64118058, 1.53740299])

Related

Realign indexes to a changed python collection

I have a collection of data and a variable containing indexes to some of them.
A filtering operation is applied on the data that eliminates a subset of the data.
I want to shift the indexes so that they refer to the updated collection of data (eliminating indexes to deleted instances).
I'm using the implementation in the function below. I'm also posting the code I used to validate that it works.
Is there a quick & fast way to do the index realignment via the core libraries or a better way in general?
import random
def align_index(wanted_idx, mask):
"""
Function to align a set of indexes to a collection after deletions,
indicated with a mask
Arguments:
wanted_idx: List of desired integer indexes prior to deletion
mask: Binary mask, where 1's indicate elements that survive deletion
Returns:
List of integer indexes to (surviving) desired elements, post-deletion
"""
# rebuild indexes: remove dangling
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
# mark deleted
not_mask = [int(not m) for m in mask]
# cumsum deleted regions
realigned_idx = [k-sum(not_mask[:k+1]) for k in new_idx]
return realigned_idx
# data
data = [random.randint(0,500) for _ in range(1000)]
rng = list(range(len(data)))
for _ in range(1000):
# random data deletion / request
wanted_idx = random.sample(rng, random.randint(5,100))
del_index = random.sample(rng, random.randint(5, 100))
# apply deletion
mask = [int(i not in del_index) for i in range(len(data))]
filtered_data = [data[i] for (i, m) in enumerate(mask) if m]
realigned_index = align_index(wanted_idx, mask)
# verify
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
l1 = [data[k] for k in new_idx]
l2 = [filtered_data[k] for k in realigned_index]
assert l1 == l2
If you use numpy it's quite trivial:
import numpy as np
mask = np.array(mask, dtype=np.bool)
new_idx = np.cumsum(mask, dtype=np.int64)
new_idx[mask] = -1
You shouldn't need to recompute new_idx unless more elements get deleted.
Then you can get the remapped index for old index i just by looking new_idx[i]. Or a whole array at once:
wanted_idx = np.array(wanted_idx, dtype=np.int64)
remapped_idx = new_idx[wanted_idx]
Note that deleted indices get assigned value -1. You can filter these out if you want:
remapped_idx = remapped_idx[remapped_idx >= 0]

How to find steps in a vector (1d array, list) in Python?

I want to get border of data in a list using python
For example I have this list :
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
I want a code that return data borders. for example:
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
^ ^ ^ ^
b = get_border_index(a)
print(b)
output:
[0,4,7,12]
How can I implement get_border_index(lst: list) -> list function?
The scalable answer that also works for very long lists or arrays is to use np.diff. In that case you should avoid a for loop at all costs.
import numpy as np
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
a = np.array(a)
# this is unequal 0 if there is a step
d = np.diff(a)
# boolean array where the steps are
is_step = d != 0
# get the indices of the steps (first one is trivial).
ics = np.where(is_step)
# get the first dimension and shift by one as you want
# the index of the element right of the step
ics_shift = ics[0] + 1
# and if you need a list
ics_list = ics_shift.tolist()
print(ics_list)
You can use for loop with enumerate
def get_border_index(a):
last_value = None
result = []
for i, v in enumerate(a):
if v != last_value:
last_value = v
result.append(i)
return result
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
b = get_border_index(a)
print(b)
Output
[0, 4, 7, 12]
This code will check if an element in the a list is different then the element before and if so it will append the index of the element to the result list.

Keep first element of array that is part of a sequence

Say I have this type of array
y
array([299839, 667136, 665420, 665418, 665421, 667135, 299799, 665419, 667137, 299800])
as the result of a "top 10" argpartition:
y = np.argpartiton(-x, np.arange(10))[:10]
Now, I want to remove the elements that are sequential, only keeping the first (maximum) element in the series such that:
y_new
array([299839, 667136, 665420, 299799])
But while that seems like it should be simple I'm not seeing an efficient way to do it (or even a good way to start). Assume the real-world application will do the top 1000 or so and need to do it many times.
Here's one approach based on sorting -
# Get the sorted indices
sidx = y.argsort()
# Get sorted array
ys = y[sidx]
# Get indices at which islands of sequential numbers start/stop
cut_idx = np.flatnonzero(np.concatenate(([True], np.diff(ys)!=1 )))
# Finally get the minimum indices for each island and then index into
# input for the desired output
y_new = y[np.minimum.reduceat(sidx, cut_idx)]
If you would like to keep the order of elements in the output, sort the indices and then index at the last step -
y[np.sort(np.minimum.reduceat(sidx, cut_idx))]
Sample input, output -
In [56]: y
Out[56]:
array([299839, 667136, 665420, 665418, 665421, 667135, 299799, 665419,
667137, 299800])
In [57]: y_new
Out[57]: array([299799, 299839, 665420, 667136])
In [58]: y[np.sort(np.minimum.reduceat(sidx, cut_idx))]
Out[58]: array([299839, 667136, 665420, 299799])
heres my implementation for that problem
from itertools import groupby
from operator import itemgetter
a = [299839, 667136, 665420, 665418, 665421, 667135, 299799, 665419,
667137, 299800]
new = a[:]
# to keep the first number
b = a[0]
new.sort()
# to store diffrent arrays
saver = []
final_array = []
for k, g in groupby(enumerate(new), lambda (i, x): i - x):
ac = map(itemgetter(1), g)
saver.append(ac)
final_array.append(b)
for i in range(len(saver)):
for j in range(len(a)):
if a[j] in saver[i]:
if b == a[j]:
continue
final_array.append(a[j])
break
print final_array
output
[299839, 299799, 665420, 667136]

Creating 'n' combinations randomly from multiple lists

def models():
default = [0.6,0.67,2.4e-2,1e-2,2e-5,1.2e-3,2e-5]
lower = [np.log10(i/10) for i in default]
upper = [np.log10(i*10) for i in default]
n = 5
a = np.logspace(lower[0],upper[0],n)
b = np.logspace(lower[1],upper[1],n)
c = np.logspace(lower[2],upper[2],n)
d = np.logspace(lower[3],upper[3],n)
e = np.logspace(lower[4],upper[4],n)
f = np.logspace(lower[5],upper[5],n)
g = np.logspace(lower[6],upper[6],n)
combs = itertools.product(a,b,c,d,e,f,g)
list1 = []
for x in combs:
x = list(x)
list1.append(x)
return list1
The code above returns a list of 5^7 = 78,125 lists. Is there a way I can combine items in a,b,c,d,e,f,g, possibly randomly, to create a list of say, 10000, lists?
You could take random samples of each array and combine them, especially if you don't need to guarantee that specific combinations don't occur more than once:
import numpy as np
import random
def random_models(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
results = []
for i in xrange(num_values):
results.append([random.choice(arr) for arr in data_arrays])
return results
l = random_models(10000)
print len(l)
Here's a version that will avoid repeats up until you request more data than can be given without repeating:
def random_models_avoid_repeats(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
# Build the range data (tuples of (lower, upper) range)
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
# Create the data arrays to sample from
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
sequence_data = []
for entry in itertools.product(*data_arrays):
sequence_data.append(entry)
results = []
# Holds the current choices to choose from. The data will come from
# sequence_data above, but randomly shuffled. Values are popped off the
# end to keep things efficient. It's possible to ask for more data than
# the samples can give without repeats. In that case, we'll reload
# temp_data, randomly shuffle again, and start the process over until we've
# delivered the number of desired results.
temp_data = []
# Build the lists
for i in xrange(num_values):
if len(temp_data) == 0:
temp_data = sequence_data[:]
random.shuffle(temp_data)
results.append(temp_data.pop())
return results
Also note that we can avoid building a results list if you make this a generator by using yield. However, you'd want to consume the results using a forstatement as well:
def random_models_avoid_repeats_generator(num_values):
n = 5
default = [0.6, 0.67, 2.4e-2, 1e-2, 2e-5, 1.2e-3, 2e-5]
# Build the range data (tuples of (lower, upper) range)
ranges = zip((np.log10(i/10) for i in default),
(np.log10(i*10) for i in default))
# Create the data arrays to sample from
data_arrays = []
for lower, upper in ranges:
data_arrays.append(np.logspace(lower, upper, n))
sequence_data = []
for entry in itertools.product(*data_arrays):
sequence_data.append(entry)
# Holds the current choices to choose from. The data will come from
# sequence_data above, but randomly shuffled. Values are popped off the
# end to keep things efficient. It's possible to ask for more data than
# the samples can give without repeats. In that case, we'll reload
# temp_data, randomly shuffle again, and start the process over until we've
# delivered the number of desired results.
temp_data = []
# Build the lists
for i in xrange(num_values):
if len(temp_data) == 0:
temp_data = sequence_data[:]
random.shuffle(temp_data)
yield temp_data.pop()
You'd have to use it like this:
for entry in random_models_avoid_repeats_generator(10000):
# Do stuff...
Or manually iterate over it using next().

Fast way to remove a few items from a list/queue

This is a follow up to a similar question which asked the best way to write
for item in somelist:
if determine(item):
code_to_remove_item
and it seems the consensus was on something like
somelist[:] = [x for x in somelist if not determine(x)]
However, I think if you are only removing a few items, most of the items are being copied into the same object, and perhaps that is slow. In an answer to another related question, someone suggests:
for item in reversed(somelist):
if determine(item):
somelist.remove(item)
However, here the list.remove will search for the item, which is O(N) in the length of the list. May be we are limited in that the list is represented as an array, rather than a linked list, so removing items will need to move everything after it. However, it is suggested here that collections.dequeue is represented as a doubly linked list. It should then be possible to remove in O(1) while iterating. How would we actually accomplish this?
Update:
I did some time testing as well, with the following code:
import timeit
setup = """
import random
random.seed(1)
b = [(random.random(),random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
"""
listcomp = """
c[:] = [x for x in b if tokeep(x)]
"""
filt = """
c = filter(tokeep, b)
"""
print "list comp = ", timeit.timeit(listcomp,setup, number = 10000)
print "filtering = ", timeit.timeit(filt,setup, number = 10000)
and got:
list comp = 4.01255393028
filtering = 3.59962391853
The list comprehension is the asymptotically optimal solution:
somelist = [x for x in somelist if not determine(x)]
It only makes one pass over the list, so runs in O(n) time. Since you need to call determine() on each object, any algorithm will require at least O(n) operations. The list comprehension does have to do some copying, but it's only copying references to the objects not copying the objects themselves.
Removing items from a list in Python is O(n), so anything with a remove, pop, or del inside the loop will be O(n**2).
Also, in CPython list comprehensions are faster than for loops.
If you need to remove item in O(1) you can use HashMaps
Since list.remove is equivalent to del list[list.index(x)], you could do:
for idx, item in enumerate(somelist):
if determine(item):
del somelist[idx]
But: you should not modify the list while iterating over it. It will bite you, sooner or later. Use filter or list comprehension first, and optimise later.
A deque is optimized for head and tail removal, not for arbitrary removal in the middle. The removal itself is fast, but you still have to traverse the list to the removal point. If you're iterating through the entire length, then the only difference between filtering a deque and filtering a list (using filter or a comprehension) is the overhead of copying, which at worst is a constant multiple; it's still a O(n) operation. Also, note that the objects in the list aren't being copied -- just the references to them. So it's not that much overhead.
It's possible that you could avoid copying like so, but I have no particular reason to believe this is faster than a straightforward list comprehension -- it's probably not:
write_i = 0
for read_i in range(len(L)):
L[write_i] = L[read_i]
if L[read_i] not in ['a', 'c']:
write_i += 1
del L[write_i:]
I took a stab at this. My solution is slower, but requires less memory overhead (i.e. doesn't create a new array). It might even be faster in some circumstances!
This code has been edited since its first posting
I had problems with timeit, I might be doing this wrong.
import timeit
setup = """
import random
random.seed(1)
global b
setup_b = [(random.random(), random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
# define and call to turn into psyco bytecode (if using psyco)
b = setup_b[:]
def listcomp():
c[:] = [x for x in b if tokeep(x)]
listcomp()
b = setup_b[:]
def filt():
c = filter(tokeep, b)
filt()
b = setup_b[:]
def forfilt():
marked = (i for i, x in enumerate(b) if tokeep(x))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfilt()
b = setup_b[:]
def forfiltCheating():
marked = (i for i, x in enumerate(b) if (x[1] > .45) and (x[1] < .5))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfiltCheating()
"""
listcomp = """
b = setup_b[:]
listcomp()
"""
filt = """
b = setup_b[:]
filt()
"""
forfilt = """
b = setup_b[:]
forfilt()
"""
forfiltCheating = '''
b = setup_b[:]
forfiltCheating()
'''
psycosetup = '''
import psyco
psyco.full()
'''
print "list comp = ", timeit.timeit(listcomp, setup, number = 10000)
print "filtering = ", timeit.timeit(filt, setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, setup, number = 10000)
print '\nnow with psyco \n'
print "list comp = ", timeit.timeit(listcomp, psycosetup + setup, number = 10000)
print "filtering = ", timeit.timeit(filt, psycosetup + setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, psycosetup + setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, psycosetup + setup, number = 10000)
And here are the results
list comp = 6.56407690048
filtering = 5.64738512039
forfilter = 7.31555104256
forfiltCheating = 4.8994679451
now with psyco
list comp = 8.0485959053
filtering = 7.79016900063
forfilter = 9.00477004051
forfiltCheating = 4.90830993652
I must be doing something wrong with psyco, because it is actually running slower.
elements are not copied by list comprehension
this took me a while to figure out. See the example code below, to experiment yourself with different approaches
code
You can specify how long a list element takes to copy and how long it takes to evaluate. The time to copy is irrelevant for list comprehension, as it turned out.
import time
import timeit
import numpy as np
def ObjectFactory(time_eval, time_copy):
"""
Creates a class
Parameters
----------
time_eval : float
time to evaluate (True or False, i.e. keep in list or not) an object
time_copy : float
time to (shallow-) copy an object. Used by list comprehension.
Returns
-------
New class with defined copy-evaluate performance
"""
class Object:
def __init__(self, id_, keep):
self.id_ = id_
self._keep = keep
def __repr__(self):
return f"Object({self.id_}, {self.keep})"
#property
def keep(self):
time.sleep(time_eval)
return self._keep
def __copy__(self): # list comprehension does not copy the object
time.sleep(time_copy)
return self.__class__(self.id_, self._keep)
return Object
def remove_items_from_list_list_comprehension(lst):
return [el for el in lst if el.keep]
def remove_items_from_list_new_list(lst):
new_list = []
for el in lst:
if el.keep:
new_list += [el]
return new_list
def remove_items_from_list_new_list_by_ind(lst):
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
return [lst[ee] for ee in new_list_inds]
def remove_items_from_list_del_elements(lst):
"""WARNING: Modifies lst"""
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
for ind in new_list_inds[::-1]:
if not lst[ind].keep:
del lst[ind]
if __name__ == "__main__":
ClassSlowCopy = ObjectFactory(time_eval=0, time_copy=0.1)
ClassSlowEval = ObjectFactory(time_eval=1e-8, time_copy=0)
keep_ratio = .8
n_runs_timeit = int(1e2)
n_elements_list = int(1e2)
lsts_to_tests = dict(
list_slow_copy_remove_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_copy_keep_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_remove_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_keep_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
)
for lbl, lst in lsts_to_tests.items():
print()
for fct in [
remove_items_from_list_list_comprehension,
remove_items_from_list_new_list,
remove_items_from_list_new_list_by_ind,
remove_items_from_list_del_elements,
]:
lst_loc = lst.copy()
t = timeit.timeit(lambda: fct(lst_loc), number=n_runs_timeit)
print(f"{fct.__name__}, {lbl}: {t=}")
output
remove_items_from_list_list_comprehension, list_slow_copy_remove_many: t=0.0064229519994114526
remove_items_from_list_new_list, list_slow_copy_remove_many: t=0.006507338999654166
remove_items_from_list_new_list_by_ind, list_slow_copy_remove_many: t=0.006562008995388169
remove_items_from_list_del_elements, list_slow_copy_remove_many: t=0.0076057760015828535
remove_items_from_list_list_comprehension, list_slow_copy_keep_many: t=0.006243691001145635
remove_items_from_list_new_list, list_slow_copy_keep_many: t=0.007145451003452763
remove_items_from_list_new_list_by_ind, list_slow_copy_keep_many: t=0.007032064997474663
remove_items_from_list_del_elements, list_slow_copy_keep_many: t=0.007690364996960852
remove_items_from_list_list_comprehension, list_slow_eval_remove_many: t=1.2495998149970546
remove_items_from_list_new_list, list_slow_eval_remove_many: t=1.1657221479981672
remove_items_from_list_new_list_by_ind, list_slow_eval_remove_many: t=1.2621939050004585
remove_items_from_list_del_elements, list_slow_eval_remove_many: t=1.4632593330024974
remove_items_from_list_list_comprehension, list_slow_eval_keep_many: t=1.1344162709938246
remove_items_from_list_new_list, list_slow_eval_keep_many: t=1.1323430630000075
remove_items_from_list_new_list_by_ind, list_slow_eval_keep_many: t=1.1354237199993804
remove_items_from_list_del_elements, list_slow_eval_keep_many: t=1.3084568729973398
import collections
list1=collections.deque(list1)
for i in list2:
try:
list1.remove(i)
except:
pass
INSTEAD OF CHECKING IF ELEMENT IS THERE. USING TRY EXCEPT.
I GUESS THIS FASTER

Categories

Resources