List Comparison Algorithm: How can it be made better? - python

Running on Python 3.3
I am attempting to create an efficient algorithm to pull all of the similar elements between two lists. The problem is two fold. First, I can not seem to find any algorithms online. Second, there should be a more efficient way.
By 'similar elements', I mean two elements that are equal in value (be it string, int, whatever).
Currently, I am taking a greedy approach by:
Sorting the lists that are being compared,
Comparing each element in the shorter list to each element in the larger list,
Since the largeList and smallList are sorted we can save the last index that was visited,
Continue from the previous index (largeIndex).
Currently, the run-time seems to be average of O(nlog(n)). This can be seen by running the test cases listed after this block of code.
Right now, my code looks as such:
def compare(small,large,largeStart,largeEnd):
for i in range(largeStart, largeEnd):
if small==large[i]:
return [1,i]
if small<large[i]:
if i!=0:
return [0,i-1]
else:
return [0, i]
return [0,largeStart]
def determineLongerList(aList, bList):
if len(aList)>len(bList):
return (aList, bList)
elif len(aList)<len(bList):
return (bList, aList)
else:
return (aList, bList)
def compareElementsInLists(aList, bList):
import time
startTime = time.time()
holder = determineLongerList(aList, bList)
sameItems = []
iterations = 0
##########################################
smallList = sorted(holder[1])
smallLength = len(smallList)
smallIndex = 0
largeList = sorted(holder[0])
largeLength = len(largeList)
largeIndex = 0
while (smallIndex<smallLength):
boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
if boolean[0]==1:
#`compare` returns 1 as True
sameItems.append(smallList[smallIndex])
oldIndex = largeIndex
largeIndex = boolean[1]
else:
#else no match and possible new index
oldIndex = largeIndex
largeIndex = boolean[1]
smallIndex+=1
iterations =largeIndex-oldIndex+iterations+1
print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
return sameItems
, and here are some test cases:
def testLargest():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
#testLargest()
'''
One rendition of testLargest:
******************************************
CREATING LISTS TOOK: 21.009342908859253
******************************************
RAN 999998 OUT OF 1000000000000 POSSIBLE
RATIO: 9.99998e-07
COMPARING LISTS TOOK: 13.99990701675415
NUMBER OF SAME ITEMS: 632328
******************************************
'''
def testLarge():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,100)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,100)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
testLarge()

If you are just searching for all elements which are in both lists, you should use data types meant to handle such tasks. In this case, sets or bags would be appropriate. These are internally represented by hashing mechanisms which are even more efficient than searching in sorted lists.
(collections.Counter represents a suitable bag.)
If you do not care for doubled elements, then sets would be fine.
a = set(listA)
print a.intersection(listB)
This will print all elements which are in listA and in listB. (Without doubled output for doubled input elements.)
import collections
a = collections.Counter(listA)
b = collections.Counter(listB)
print a & b
This will print how many elements are how often in both lists.
I didn't make any measuring but I'm pretty sure these solutions are way faster than your self-made attempts.
To convert a counter into a list of all represented elements again, you can use list(c.elements()).

Using ipython magic for timeit but it doesn't compare favourably with just a standard set() intersection.
Setup:
import random
alist = [random.randint(0, 100000) for _ in range(1000)]
blist = [random.randint(0, 100000) for _ in range(1000)]
Compare Elements:
%%timeit -n 1000
compareElementsInLists(alist, blist)
1000 loops, best of 3: 1.9 ms per loop
Vs Set Intersection
%%timeit -n 1000
set(alist) & set(blist)
1000 loops, best of 3: 104 µs per loop
Just to make sure we get the same results:
>>> compareElementsInLists(alist, blist)
[8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791]
>>> set(alist) & set(blist)
{8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791}

Related

Fastest way to extract and increase latest number from end of string

I have a list of strings that have numbers as suffixes. I'm trying to extract the highest number so I can increase it by 1. Here's what I came up with but I'm wondering if there's a faster way to do this:
data = ["object_1", "object_2", "object_3", "object_blah", "object_123asdfd"]
numbers = [int(obj.split("_")[-1]) for obj in data if obj.split("_")[-1].isdigit()] or [0]
print sorted(numbers)[-1] + 1 # Output is 4
A few conditions:
It's very possible that the suffix is not a number at all, and should be skipped.
If no input is valid, then the output should be 1 (this is why I have or [0])
No Python 3 solutions, only 2.7.
Maybe some regex magic would be faster to find the highest number to increment on? I don't like the fact that I have to split twice.
Edit
I did some benchmarks on the current answers using 100 iterations on data that has 10000 items:
Alex Noname's method: 1.65s
Sushanth's method: 1.95s
Balaji Ambresh method: 2.12s
My original method: 2.16s
I've accepted an answer for now, but feel free to contribute.
Using a heapq.nlargest is a pretty efficient way. Maybe someone will compare with other methods.
import heapq
a = heapq.nlargest(1, map(int, filter(lambda b: b.isdigit(), (c.split('_')[-1] for c in data))))[0]
Comparing with the original method (Python 3.8)
import heapq
import random
from time import time
data = []
for i in range(0, 1000000):
data.append(f'object_{random.randrange(10000000)}')
begin = time()
a = heapq.nlargest(1, map(int, filter(lambda b: b.isdigit(), (c.split('_')[-1] for c in data))))[0]
print('nlargest method: ', time() - begin)
print(a)
begin = time()
numbers = [int(obj.split("_")[-1]) for obj in data if obj.split("_")[-1].isdigit()] or [0]
a = sorted(numbers)[-1]
print('original method: ', time() - begin)
print(a)
nlargest method: 0.4306185245513916
9999995
original method: 0.8409149646759033
9999995
try this, using list comprehension to get all digits & max would return the highest value.
max([
int(x.split("_")[-1]) if x.split("_")[-1].isdigit() else 0 for x in data
]) + 1
Try:
import re
res = max([int( (re.findall('_(\d+)$', item) or [0])[0] ) for item in data]) + 1
Value:
4

Fastest way to determine if an ordered sublist is in a large lists of lists?

Suppose I have a my_huge_list_of_lists with 2,000,000 lists in it, each list about 50 in length.
I want to shorten the 2,000,000 my_huge_list_of_lists by discarding sublists that do not contain two elements in the sequence.
So far I have:
# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a,b])]
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [b,a])]
The consecutiveness of the search term [a,b] or [b,a] is important so I can't use a set.issubset().
I find this slow. I'd like to speed it up. I've considered a few options like using an 'early exit' and statement:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if (a in x and not check_if_list_is_sublist(x, [a,b]))]
and less times in the for loop with an or statement:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not (check_if_list_is_sublist(x, [a,b])
or check_if_list_is_sublist(x, [b,a]))]
and also working on speeding up the function (WIP)
# https://stackoverflow.com/questions/48232080/the-fastest-way-to-check-if-the-sub-list-exists-on-the-large-list
def check_if_list_is_sublist(lst, sublst):
checks if a list appears in order in another larger list.
set_of_sublists = {tuple(sublst) for sublist in lst}
and done some searching on Stack Overflow; but can't think of a way because the number of times check_if_list_is_sublist() is called is len(my_huge_list) * 2.
edit: add some user data as requested
from random import randint
from string import ascii_lowercase
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
my_neighbor_search_fwd = [i,c]
my_neighbor_search_rev = my_neighbor_search_fwd.reverse()
Unpack the item in the n-sized subsequence into n variables. Then write a list comprehension to filter the list doing a check for a, b or b, a in the sub-list.e.g.
a, b = sublst
def checklst(lst, a, b):
l = len(lst)
start = 0
while True:
try:
a_index = lst.index(a, start)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
start = a_index + 1
if start == l:
return False
continue # keep looking at the next a
%timeit found = [l for l in lst if checklst(l, a, b)]
1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, I can't think of any clever algorithm checks to really reduce the amount of work here. However, you are doing a LOT of allocations in your code, and iterating too far. So, just moving some declarations out of the function a bit got me
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
Which reduced the run-time of your sample code above by about 50%. With a list this size, spawning some more processes and dividing the work would probably see a performance increase as well. Can't think of any way to really reduce the amount of comparisons though...
Although this isn't what you'd call an "answer" per se, but rather it's a benchmarking framework that should help you determine the quickest way to accomplish what you want because it allows relatively easy modification as well as the addition of different approaches.
I've put the answers currently posted into it, as well as the results of running it with them.
Caveats: Note that I haven't verified that all the tested answers in it are "correct" in the sense that they actually do what you want, nor how much memory they'll consume in the process—which might be another consideration.
Currently it appears that #Oluwafemi Sule's answer is the fastest by a order of magnitude (10x times) from the closest competitor.
from __future__ import print_function
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 10 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
from random import randint, randrange, seed
from string import ascii_lowercase
a, b = 'a', 'b'
NUM_SUBLISTS = 1000
SUBLIST_LEN = 50
PERCENTAGE = 50 # Percentage of sublist that should get removed.
seed(42) # Initialize random number so the results are reproducible.
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
for __ in range(NUM_SUBLISTS)]
# Put the target sequence in percentage of the sublists so they'll be removed.
for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
list_index = randrange(NUM_SUBLISTS)
sublist_index = randrange(SUBLIST_LEN)
my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"OP (Nas Banov)": TestCase("""
# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
''' Checks if a list appears in order in another larger list. '''
n = len(sublst)
return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a, b])]
"""
),
"Sphinx Solution 1 (hash)": TestCase("""
# https://stackoverflow.com/a/49518843/355230
# Solution 1: By using built-in hash function.
def prepare1(huge_list, interval=1): # Use built-in hash function.
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
def check_sublist1(hash_db, sublst): # Use built-in hash function.
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
""", """
hash_db = prepare1(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist1(hash_db, x)]
"""
),
"Sphinx Solution 2 (str)": TestCase("""
# https://stackoverflow.com/a/49518843/355230
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): # Use str() as hash function.
return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
for index in range(len(huge_list) - interval + 1)}
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
""", """
hash_db = prepare2(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist2(hash_db, x)]
"""
),
"Paul Becotte": TestCase("""
# https://stackoverflow.com/a/49504792/355230
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
"""
),
"Oluwafemi Sule": TestCase("""
# https://stackoverflow.com/a/49504440/355230
def checklst(lst, a, b):
try:
a_index = lst.index(a)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not checklst(x, a, b)]
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
.format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
print()
Output:
Results for 1,000 sublists of length 50 with 50% percent of them matching
Fastest to slowest execution speeds using Python 3.6.4
(10 executions, best of 3 repetitions)
Oluwafemi Sule : 0.006441 secs, rel speed 1.00x, 0.00% slower
Paul Becotte : 0.069462 secs, rel speed 10.78x, 978.49% slower
OP (Nas Banov) : 0.082758 secs, rel speed 12.85x, 1184.92% slower
Sphinx Solution 2 (str) : 0.152119 secs, rel speed 23.62x, 2261.84% slower
Sphinx Solution 1 (hash) : 0.154562 secs, rel speed 24.00x, 2299.77% slower
For search match in one large list, I believe hash(element) then build indexes will be one good solution.
The benefit you will get:
build indexes once, save your time for future use (don't need to loop again and again for each search).
Even, we can build indexes when launching the program, then release it when program exits,
Below codes use two methods to get hash value: hash() and str(); sometimes you should customize one hash function based on your specific scenarios.
If use str(), the codes seems simple, and don't need to consider the hash conflict. But it may cause memory bomb up.
For hash(), I used the list to save all sub_lst which has same hash value. and you can use hash(sub_lst)%designed_length to control hash size (but it will increase the hash conflict rate).
Output for below codes:
By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]
Test Codes:
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}
hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))
print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))
If you'd like to remove the matched elements from one list, it is doable, but the effect is you may have to rebuild the indexes for the new list. Unless the list is a chain list then save the pointer for each element in the indexes. I just google Python how to get the pointer for one element of a list, but can't find anything helpful. If someone knows how to do, please don't hesitate to share your solution. Thanks.
Below is one sample: (it generate one new list instead of return original one, sometimes we still need to filter something from original list)
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
else:
hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
return hash_db
hash_db = prepare(my_huge_list_of_lists, interval=2)
def check_sublist(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
return []
def remove_if_match_sublist(target_list, hash_db, sublsts):
matches = []
for sublst in sublsts:
matches += check_sublist(hash_db, sublst)
#make sure delete elements from end to begin
sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
new_list = list(target_list)
for item in sorted_match:
del new_list[item['beg']:item['end']]
return new_list
print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))

Fastest way to find Indexes of item in list?

If one was to attempt to find the indexes of an item in a list you could do it a couple different ways here is what I know to be the fastest:
aList = [123, 'xyz', 'zara','xyz', 'abc'];
indices = [i for i, x in enumerate(aList) if x == "xyz"]
print(indices)
Another way not pythonic and slower:
count = 0
indices = []
aList = [123, 'xyz', 'zara','xyz', 'abc'];
for i in range(0,len(aList):
if 'xyz' == aList[i]:
indices.append(i)
print(indices)
The first method is undoubtedly faster however what if you wanted to go faster, is there a way? For the first index using method:
aList = [123, 'xyz', 'zara','xyz', 'abc'];
print "Index for xyz : ", aList.index( 'xyz' )
is very fast but can't handle multiple indexes.
How might one go about speeding things up?
Use list.index(elem, start)! That uses a for loop in C (see its implementation list_index_impl function in the source of CPython's listobject.c).
Avoid looping through all the elements in Python, it is slower than in C.
def index_finder(lst, item):
"""A generator function, if you might not need all the indices"""
start = 0
while True:
try:
start = lst.index(item, start)
yield start
start += 1
except ValueError:
break
import array
def index_find_all(lst, item, results=None):
""" If you want all the indices.
Pass results=[] if you explicitly need a list,
or anything that can .append(..)
"""
if results is None:
length = len(lst)
results = (array.array('B') if length <= 2**8 else
array.array('H') if length <= 2**16 else
array.array('L') if length <= 2**32 else
array.array('Q'))
start = 0
while True:
try:
start = lst.index(item, start)
results.append(start)
start += 1
except ValueError:
return results
# Usage example
l = [1, 2, 3, 4, 5, 6, 7, 8] * 32
print(*index_finder(l, 1))
print(*index_find_all(l, 1))
def find(target, myList):
for i in range(len(myList)):
if myList[i] == target:
yield i
def find_with_list(myList, target):
inds = []
for i in range(len(myList)):
if myList[i] == target:
inds += i,
return inds
In [8]: x = range(50)*200
In [9]: %timeit [i for i,j in enumerate(x) if j == 3]
1000 loops, best of 3: 598 us per loop
In [10]: %timeit list(find(3,x))
1000 loops, best of 3: 607 us per loop
In [11]: %timeit find(3,x)
1000000 loops, best of 3: 375 ns per loop
In [55]: %timeit find_with_list(x,3)
1000 loops, best of 3: 618 us per loop
Assuming you want a list as your output:
All options seemed exhibit similar time performance for my test with the list comprehension being the fastest (barely).
If using a generator is acceptable, it's way faster than the other approaches. Though it doesn't account for actually iterating over the indices, nor does it store them, so the indices cannot be iterated over a second time.
Simply create a dictionary of item->index from the list of items using zip like so:
items_as_dict = dict(zip(list_of_items,range(0,len(list_of_items))))
index = items_as_dict(item)
To get the index of the item, you can use the dictionary.
aList = [123, 'xyz', 'zara','xyz', 'abc'];
#The following apporach works only on lists with unique values
aList = list(np.unique(aList));
dict = enumerate(aList);
# get inverse mapping of above dictionary, replace key with values
inv_dict = dict(zip(dict.values(),dict.keys()))
# to get index of item by value, use 'inv_dict' and to get value by index, use 'dict'
valueofItemAtIndex0 = dict[0]; # value = 123
indexofItemWithValue123 = inv_dict[123]; # index = 0
D=dict()
for i, item in enumerate(l):
if item not in D:
D[item] = [i]
else:
D[item].append(i)
Then simply call D[item] to get the indices that match. You'll give up initial calculation time but gain it during call time.
I used another way to find the index of a element in a list in Python 3:
def index_of(elem, a):
a_e = enumerate(a)
a_f = list(filter(lambda x: x[1] == elem, a_e))
if a_f:
return a_f[0][0]
else:
return -1
Some tests:
a=[1,2,3,4,2]
index_of(2,a)
This function always return the first occurrence of the element. If element ins't in the list, return -1. For my goals, that solution worked well.

Python: how to search for a substring in a set the fast way?

I have a set containing ~300.000 tuples
In [26]: sa = set(o.node for o in vrts_l2_5)
In [27]: len(sa)
Out[27]: 289798
In [31]: random.sample(sa, 1)
Out[31]: [('835644', '4696507')]
Now I want to lookup elements based on a common substring, e.g. the first 4 'digits' (in fact the elements are strings). This is my approach:
def lookup_set(x_appr, y_appr):
return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]
In [36]: lookup_set('6652','46529')
Out[36]: [('665274', '4652941'), ('665266', '4652956')]
Is there a more efficient, that is, faster way to to this?
You can do it in O(log(n) + m) time, where n is the number of tuples and m is the number of matching tuples, if you can afford to keep two sorted copies of the tuples.
Sorting itself will cost O(nlog(n)), i.e. it will be asymptotically slower then your naive approach, but if you have to do a certain number of queries(more than log(n), which is almost certainly quite small) it will pay off.
The idea is that you can use bisection to find the candidates that have the correct first value and the correct second value and then intersect these sets.
However note that you want a strange kind of comparison: you care for all strings starting with the given argument. This simply means that when searching for the right-most occurrence you should fill the key with 9s.
A complete working(although not tested very much) code:
from random import randint
from operator import itemgetter
first = itemgetter(0)
second = itemgetter(1)
sa = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
f_sorted = sorted(sa, key=first)
s_sorted = sa
s_sorted.sort(key=second)
max_length = max(len(s) for _,s in sa)
# See: bisect module from stdlib
def bisect_right(seq, element, key):
lo = 0
hi = len(seq)
element = element.ljust(max_length, '9')
while lo < hi:
mid = (lo+hi)//2
if element < key(seq[mid]):
hi = mid
else:
lo = mid + 1
return lo
def bisect_left(seq, element, key):
lo = 0
hi = len(seq)
while lo < hi:
mid = (lo+hi)//2
if key(seq[mid]) < element:
lo = mid + 1
else:
hi = mid
return lo
def lookup_set(x_appr, y_appr):
x_left = bisect_left(f_sorted, x_appr, key=first)
x_right = bisect_right(f_sorted, x_appr, key=first)
x_candidates = f_sorted[x_left:x_right + 1]
y_left = bisect_left(s_sorted, y_appr, key=second)
y_right = bisect_right(s_sorted, y_appr, key=second)
y_candidates = s_sorted[y_left:y_right + 1]
return set(x_candidates).intersection(y_candidates)
And the comparison with your initial solution:
In [2]: def lookup_set2(x_appr, y_appr):
...: return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]
In [3]: lookup_set('123', '124')
Out[3]: set([])
In [4]: lookup_set2('123', '124')
Out[4]: []
In [5]: lookup_set('123', '125')
Out[5]: set([])
In [6]: lookup_set2('123', '125')
Out[6]: []
In [7]: lookup_set('12', '125')
Out[7]: set([('12478', '125908'), ('124625', '125184'), ('125494', '125940')])
In [8]: lookup_set2('12', '125')
Out[8]: [('124625', '125184'), ('12478', '125908'), ('125494', '125940')]
In [9]: %timeit lookup_set('12', '125')
1000 loops, best of 3: 589 us per loop
In [10]: %timeit lookup_set2('12', '125')
10 loops, best of 3: 145 ms per loop
In [11]: %timeit lookup_set('123', '125')
10000 loops, best of 3: 102 us per loop
In [12]: %timeit lookup_set2('123', '125')
10 loops, best of 3: 144 ms per loop
As you can see this solution is about 240-1400 times faster(in these examples) than your naive approach.
If you have a big set of matches:
In [19]: %timeit lookup_set('1', '2')
10 loops, best of 3: 27.1 ms per loop
In [20]: %timeit lookup_set2('1', '2')
10 loops, best of 3: 152 ms per loop
In [21]: len(lookup_set('1', '2'))
Out[21]: 3587
In [23]: %timeit lookup_set('', '2')
10 loops, best of 3: 182 ms per loop
In [24]: %timeit lookup_set2('', '2')
1 loops, best of 3: 212 ms per loop
In [25]: len(lookup_set2('', '2'))
Out[25]: 33053
As you can see this solution is faster even if the number of matches is about 10% of the total size. However, if you try to match all the data:
In [26]: %timeit lookup_set('', '')
1 loops, best of 3: 360 ms per loop
In [27]: %timeit lookup_set2('', '')
1 loops, best of 3: 221 ms per loop
It becomes (not so much) slower, although this is a quite peculiar case, and I doubt you'll frequently match almost all the elements.
Note that the time take to sort the data is quite small:
In [13]: from random import randint
...: from operator import itemgetter
...:
...: first = itemgetter(0)
...: second = itemgetter(1)
...:
...: sa2 = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
In [14]: %%timeit
...: f_sorted = sorted(sa2, key=first)
...: s_sorted = sorted(sa2, key=second)
...: max_length = max(len(s) for _,s in sa2)
...:
1 loops, best of 3: 881 ms per loop
As you can see it takes less than one second to do the two sorted copies. Actually the above code would be slightly faster since it sorts "in-place" the second copy(although tim-sort could still require O(n) memory).
This means that if you have to do more than about 6-8 queries this solution will be faster.
Note: python'd standard library provides a bisect module. However it doesn't allow a key parameter(even though I remember reading that Guido wanted it, so it may be added in the future). Hence if you want to use it directly, you'll have to use the "decorate-sort-undecorate" idiom.
Instead of:
f_sorted = sorted(sa, key=first)
You should do:
f_sorted = sorted((first, (first,second)) for first,second in sa)
I.e. you explicitly insert the key as the first element of the tuple. Afterwards you could use ('123', '') as element to pass to the bisect_* functions and it should find the correct index.
I decided to avoid this. I copy pasted the code from the sources of the module and slightly modified it to provide a simpler interface for your use-case.
Final remark: if you could convert the tuple elements to integers then the comparisons would be faster. However, most of the time would still be taken to perform the intersection of the sets, so I don't know exactly how much it will improve performances.
You could use a trie data structure. It is possible to build one with a tree of dict objects (see How to create a TRIE in Python) but there is a package marisa-trie that implements a memory-efficient version by binding to c++ libraries
I have not used this library before, but playing around with it, I got this working:
from random import randint
from marisa_trie import RecordTrie
sa = [(str(randint(1000000,9999999)),str(randint(1000000,9999999))) for i in range(100000)]
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip((unicode(first) for first, _ in sa), sa)),
RecordTrie(fmt, zip((unicode(second) for _, second in sa), sa)))
def lookup_set(sa_tries, x_appr, y_appr):
"""lookup prefix in the appropriate trie and intersect the result"""
return (set(item[1] for item in sa_tries[0].items(unicode(x_appr))) &
set(item[1] for item in sa_tries[1].items(unicode(y_appr))))
lookup_set(sa_tries, "2", "4")
I went through and implemented the 4 suggested solutions to compare their efficiency. I ran the tests with different prefix lengths to see how the input would affect performance. The trie and sorted list performance is definitely sensitive to the length of input with both getting faster as the input gets longer (I think it is actually sensitivity to the size of output since the output gets smaller as the prefix gets longer). However, the sorted set solution is definitely faster in all situations.
In these timing tests, there were 200000 tuples in sa and 10 runs for each method:
for prefix length 1
lookup_set_startswith : min=0.072107 avg=0.073878 max=0.077299
lookup_set_int : min=0.030447 avg=0.037739 max=0.045255
lookup_set_trie : min=0.111548 avg=0.124679 max=0.147859
lookup_set_sorted : min=0.012086 avg=0.013643 max=0.016096
for prefix length 2
lookup_set_startswith : min=0.066498 avg=0.069850 max=0.081271
lookup_set_int : min=0.027356 avg=0.034562 max=0.039137
lookup_set_trie : min=0.006949 avg=0.010091 max=0.032491
lookup_set_sorted : min=0.000915 avg=0.000944 max=0.001004
for prefix length 3
lookup_set_startswith : min=0.065708 avg=0.068467 max=0.079485
lookup_set_int : min=0.023907 avg=0.033344 max=0.043196
lookup_set_trie : min=0.000774 avg=0.000854 max=0.000929
lookup_set_sorted : min=0.000149 avg=0.000155 max=0.000163
for prefix length 4
lookup_set_startswith : min=0.065742 avg=0.068987 max=0.077351
lookup_set_int : min=0.026766 avg=0.034558 max=0.052269
lookup_set_trie : min=0.000147 avg=0.000167 max=0.000189
lookup_set_sorted : min=0.000065 avg=0.000068 max=0.000070
Here's the code:
import random
def random_digits(num_digits):
return random.randint(10**(num_digits-1), (10**num_digits)-1)
sa = [(str(random_digits(6)),str(random_digits(7))) for _ in range(200000)]
### naive approach
def lookup_set_startswith(x_appr, y_appr):
return [item for item in sa if item[0].startswith(x_appr) and item[1].startswith(y_appr) ]
### trie approach
from marisa_trie import RecordTrie
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip([unicode(first) for first, second in sa], sa)),
RecordTrie(fmt, zip([unicode(second) for first, second in sa], sa)))
def lookup_set_trie(x_appr, y_appr):
# lookup prefix in the appropriate trie and intersect the result
return set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & \
set(item[1] for item in sa_tries[1].items(unicode(y_appr)))
### int approach
sa_ints = [(int(first), int(second)) for first, second in sa]
sa_lens = tuple(map(len, sa[0]))
def lookup_set_int(x_appr, y_appr):
x_limit = 10**(sa_lens[0]-len(x_appr))
y_limit = 10**(sa_lens[1]-len(y_appr))
x_int = int(x_appr) * x_limit
y_int = int(y_appr) * y_limit
return [sa[i] for i, int_item in enumerate(sa_ints) \
if (x_int <= int_item[0] and int_item[0] < x_int+x_limit) and \
(y_int <= int_item[1] and int_item[1] < y_int+y_limit) ]
### sorted set approach
from operator import itemgetter
first = itemgetter(0)
second = itemgetter(1)
sa_sorted = (sorted(sa, key=first), sorted(sa, key=second))
max_length = max(len(s) for _,s in sa)
# See: bisect module from stdlib
def bisect_right(seq, element, key):
lo = 0
hi = len(seq)
element = element.ljust(max_length, '9')
while lo < hi:
mid = (lo+hi)//2
if element < key(seq[mid]):
hi = mid
else:
lo = mid + 1
return lo
def bisect_left(seq, element, key):
lo = 0
hi = len(seq)
while lo < hi:
mid = (lo+hi)//2
if key(seq[mid]) < element:
lo = mid + 1
else:
hi = mid
return lo
def lookup_set_sorted(x_appr, y_appr):
x_left = bisect_left(sa_sorted[0], x_appr, key=first)
x_right = bisect_right(sa_sorted[0], x_appr, key=first)
x_candidates = sa_sorted[0][x_left:x_right]
y_left = bisect_left(sa_sorted[1], y_appr, key=second)
y_right = bisect_right(sa_sorted[1], y_appr, key=second)
y_candidates = sa_sorted[1][y_left:y_right]
return set(x_candidates).intersection(y_candidates)
####
# test correctness
ntests = 10
candidates = [lambda x, y: set(lookup_set_startswith(x,y)),
lambda x, y: set(lookup_set_int(x,y)),
lookup_set_trie,
lookup_set_sorted]
print "checking correctness (or at least consistency)..."
for dlen in range(1,5):
print "prefix length %d:" % dlen,
for i in range(ntests):
print " #%d" % i,
prefix = map(str, (random_digits(dlen), random_digits(dlen)))
answers = [c(*prefix) for c in candidates]
for i, ans in enumerate(answers):
for j, ans2 in enumerate(answers[i+1:]):
assert ans == ans2, "answers for %s for #%d and #%d don't match" \
% (prefix, i, j+i+1)
print
####
# time calls
import timeit
import numpy as np
ntests = 10
candidates = [lookup_set_startswith,
lookup_set_int,
lookup_set_trie,
lookup_set_sorted]
print "timing..."
for dlen in range(1,5):
print "for prefix length", dlen
times = [ [] for c in candidates ]
for _ in range(ntests):
prefix = map(str, (random_digits(dlen), random_digits(dlen)))
for c, c_times in zip(candidates, times):
tstart = timeit.default_timer()
trash = c(*prefix)
c_times.append(timeit.default_timer()-tstart)
for c, c_times in zip(candidates, times):
print " %-25s: min=%f avg=%f max=%f" % (c.func_name, min(c_times), np.mean(c_times), max(c_times))
Integer manipulation is much faster than string. (and smaller in memory as well)
So if you can compare integers instead you'll be much faster.
I suspect something like this should work for you:
sa = set(int(o.node) for o in vrts_l2_5)
Then this may work for you:
def lookup_set(samples, x_appr, x_len, y_appr, y_len):
"""
x_appr == SSS0000 where S is the digit to search for
x_len == number of digits to S (if SSS0000 then x_len == 4)
"""
return ((x, y) for x, y in samples if round(x, -x_len) == x_appr and round(y, -y_len) == y_approx)
Also, it returns a generator, so you're not loading all the results into memory at once.
Updated to use round method mentioned by Bakuriu
There may be, but not by terribly much. str.startswith and and are both shortcutting operators (they can return once they find a failure), and indexing tuples is a fast operation. Most of the time spent here will be from object lookups, such as finding the startswith method for each string. Probably the most worthwhile option is to run it through Pypy.
A faster solution would be to create a dictionary dict and put the first value as a key and the second as a value.
Then you would search keys matching x_appr in the ordered key list of dict (the ordered list would allow you to optimize the search in key list with a dichotomy for example). This will provide a key list named for example k_list.
And then lookup for values of dict having a key in k_list and matching y_appr.
You can also include the second step (value that match y_appr) before appending to k_list. So that k_list will contains all the key of the correct elements of dict.
Here I've just compare 'in' method and 'find' method:
The CSV input file contains a list of URL
# -*- coding: utf-8 -*-
### test perfo str in set
import re
import sys
import time
import json
import csv
import timeit
cache = set()
#######################################################################
def checkinCache(c):
global cache
for s in cache:
if c in s:
return True
return False
#######################################################################
def checkfindCache(c):
global cache
for s in cache:
if s.find(c) != -1:
return True
return False
#######################################################################
print "1/3-loading pages..."
with open("liste_all_meta.csv.clean", "rb") as f:
reader = csv.reader(f, delimiter=",")
for i,line in enumerate(reader):
cache.add(re.sub("'","",line[2].strip()))
print " "+str(len(cache))+" PAGES IN CACHE"
print "2/3-test IN..."
tstart = timeit.default_timer()
for i in range(0, 1000):
checkinCache("string to find"+str(i))
print timeit.default_timer()-tstart
print "3/3-test FIND..."
tstart = timeit.default_timer()
for i in range(0, 1000):
checkfindCache("string to find"+str(i))
print timeit.default_timer()-tstart
print "\n\nBYE\n"
results in seconds:
1/3-loading pages...
482897 PAGES IN CACHE
2/3-test IN...
107.765980005
3/3-test FIND...
167.788629055
BYE
so, the 'in' method is faster than 'find' method :)
Have fun

Fast way to remove a few items from a list/queue

This is a follow up to a similar question which asked the best way to write
for item in somelist:
if determine(item):
code_to_remove_item
and it seems the consensus was on something like
somelist[:] = [x for x in somelist if not determine(x)]
However, I think if you are only removing a few items, most of the items are being copied into the same object, and perhaps that is slow. In an answer to another related question, someone suggests:
for item in reversed(somelist):
if determine(item):
somelist.remove(item)
However, here the list.remove will search for the item, which is O(N) in the length of the list. May be we are limited in that the list is represented as an array, rather than a linked list, so removing items will need to move everything after it. However, it is suggested here that collections.dequeue is represented as a doubly linked list. It should then be possible to remove in O(1) while iterating. How would we actually accomplish this?
Update:
I did some time testing as well, with the following code:
import timeit
setup = """
import random
random.seed(1)
b = [(random.random(),random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
"""
listcomp = """
c[:] = [x for x in b if tokeep(x)]
"""
filt = """
c = filter(tokeep, b)
"""
print "list comp = ", timeit.timeit(listcomp,setup, number = 10000)
print "filtering = ", timeit.timeit(filt,setup, number = 10000)
and got:
list comp = 4.01255393028
filtering = 3.59962391853
The list comprehension is the asymptotically optimal solution:
somelist = [x for x in somelist if not determine(x)]
It only makes one pass over the list, so runs in O(n) time. Since you need to call determine() on each object, any algorithm will require at least O(n) operations. The list comprehension does have to do some copying, but it's only copying references to the objects not copying the objects themselves.
Removing items from a list in Python is O(n), so anything with a remove, pop, or del inside the loop will be O(n**2).
Also, in CPython list comprehensions are faster than for loops.
If you need to remove item in O(1) you can use HashMaps
Since list.remove is equivalent to del list[list.index(x)], you could do:
for idx, item in enumerate(somelist):
if determine(item):
del somelist[idx]
But: you should not modify the list while iterating over it. It will bite you, sooner or later. Use filter or list comprehension first, and optimise later.
A deque is optimized for head and tail removal, not for arbitrary removal in the middle. The removal itself is fast, but you still have to traverse the list to the removal point. If you're iterating through the entire length, then the only difference between filtering a deque and filtering a list (using filter or a comprehension) is the overhead of copying, which at worst is a constant multiple; it's still a O(n) operation. Also, note that the objects in the list aren't being copied -- just the references to them. So it's not that much overhead.
It's possible that you could avoid copying like so, but I have no particular reason to believe this is faster than a straightforward list comprehension -- it's probably not:
write_i = 0
for read_i in range(len(L)):
L[write_i] = L[read_i]
if L[read_i] not in ['a', 'c']:
write_i += 1
del L[write_i:]
I took a stab at this. My solution is slower, but requires less memory overhead (i.e. doesn't create a new array). It might even be faster in some circumstances!
This code has been edited since its first posting
I had problems with timeit, I might be doing this wrong.
import timeit
setup = """
import random
random.seed(1)
global b
setup_b = [(random.random(), random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
# define and call to turn into psyco bytecode (if using psyco)
b = setup_b[:]
def listcomp():
c[:] = [x for x in b if tokeep(x)]
listcomp()
b = setup_b[:]
def filt():
c = filter(tokeep, b)
filt()
b = setup_b[:]
def forfilt():
marked = (i for i, x in enumerate(b) if tokeep(x))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfilt()
b = setup_b[:]
def forfiltCheating():
marked = (i for i, x in enumerate(b) if (x[1] > .45) and (x[1] < .5))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfiltCheating()
"""
listcomp = """
b = setup_b[:]
listcomp()
"""
filt = """
b = setup_b[:]
filt()
"""
forfilt = """
b = setup_b[:]
forfilt()
"""
forfiltCheating = '''
b = setup_b[:]
forfiltCheating()
'''
psycosetup = '''
import psyco
psyco.full()
'''
print "list comp = ", timeit.timeit(listcomp, setup, number = 10000)
print "filtering = ", timeit.timeit(filt, setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, setup, number = 10000)
print '\nnow with psyco \n'
print "list comp = ", timeit.timeit(listcomp, psycosetup + setup, number = 10000)
print "filtering = ", timeit.timeit(filt, psycosetup + setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, psycosetup + setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, psycosetup + setup, number = 10000)
And here are the results
list comp = 6.56407690048
filtering = 5.64738512039
forfilter = 7.31555104256
forfiltCheating = 4.8994679451
now with psyco
list comp = 8.0485959053
filtering = 7.79016900063
forfilter = 9.00477004051
forfiltCheating = 4.90830993652
I must be doing something wrong with psyco, because it is actually running slower.
elements are not copied by list comprehension
this took me a while to figure out. See the example code below, to experiment yourself with different approaches
code
You can specify how long a list element takes to copy and how long it takes to evaluate. The time to copy is irrelevant for list comprehension, as it turned out.
import time
import timeit
import numpy as np
def ObjectFactory(time_eval, time_copy):
"""
Creates a class
Parameters
----------
time_eval : float
time to evaluate (True or False, i.e. keep in list or not) an object
time_copy : float
time to (shallow-) copy an object. Used by list comprehension.
Returns
-------
New class with defined copy-evaluate performance
"""
class Object:
def __init__(self, id_, keep):
self.id_ = id_
self._keep = keep
def __repr__(self):
return f"Object({self.id_}, {self.keep})"
#property
def keep(self):
time.sleep(time_eval)
return self._keep
def __copy__(self): # list comprehension does not copy the object
time.sleep(time_copy)
return self.__class__(self.id_, self._keep)
return Object
def remove_items_from_list_list_comprehension(lst):
return [el for el in lst if el.keep]
def remove_items_from_list_new_list(lst):
new_list = []
for el in lst:
if el.keep:
new_list += [el]
return new_list
def remove_items_from_list_new_list_by_ind(lst):
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
return [lst[ee] for ee in new_list_inds]
def remove_items_from_list_del_elements(lst):
"""WARNING: Modifies lst"""
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
for ind in new_list_inds[::-1]:
if not lst[ind].keep:
del lst[ind]
if __name__ == "__main__":
ClassSlowCopy = ObjectFactory(time_eval=0, time_copy=0.1)
ClassSlowEval = ObjectFactory(time_eval=1e-8, time_copy=0)
keep_ratio = .8
n_runs_timeit = int(1e2)
n_elements_list = int(1e2)
lsts_to_tests = dict(
list_slow_copy_remove_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_copy_keep_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_remove_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_keep_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
)
for lbl, lst in lsts_to_tests.items():
print()
for fct in [
remove_items_from_list_list_comprehension,
remove_items_from_list_new_list,
remove_items_from_list_new_list_by_ind,
remove_items_from_list_del_elements,
]:
lst_loc = lst.copy()
t = timeit.timeit(lambda: fct(lst_loc), number=n_runs_timeit)
print(f"{fct.__name__}, {lbl}: {t=}")
output
remove_items_from_list_list_comprehension, list_slow_copy_remove_many: t=0.0064229519994114526
remove_items_from_list_new_list, list_slow_copy_remove_many: t=0.006507338999654166
remove_items_from_list_new_list_by_ind, list_slow_copy_remove_many: t=0.006562008995388169
remove_items_from_list_del_elements, list_slow_copy_remove_many: t=0.0076057760015828535
remove_items_from_list_list_comprehension, list_slow_copy_keep_many: t=0.006243691001145635
remove_items_from_list_new_list, list_slow_copy_keep_many: t=0.007145451003452763
remove_items_from_list_new_list_by_ind, list_slow_copy_keep_many: t=0.007032064997474663
remove_items_from_list_del_elements, list_slow_copy_keep_many: t=0.007690364996960852
remove_items_from_list_list_comprehension, list_slow_eval_remove_many: t=1.2495998149970546
remove_items_from_list_new_list, list_slow_eval_remove_many: t=1.1657221479981672
remove_items_from_list_new_list_by_ind, list_slow_eval_remove_many: t=1.2621939050004585
remove_items_from_list_del_elements, list_slow_eval_remove_many: t=1.4632593330024974
remove_items_from_list_list_comprehension, list_slow_eval_keep_many: t=1.1344162709938246
remove_items_from_list_new_list, list_slow_eval_keep_many: t=1.1323430630000075
remove_items_from_list_new_list_by_ind, list_slow_eval_keep_many: t=1.1354237199993804
remove_items_from_list_del_elements, list_slow_eval_keep_many: t=1.3084568729973398
import collections
list1=collections.deque(list1)
for i in list2:
try:
list1.remove(i)
except:
pass
INSTEAD OF CHECKING IF ELEMENT IS THERE. USING TRY EXCEPT.
I GUESS THIS FASTER

Categories

Resources