binarySearch vs in, Unexpected Results (Python) - python

I am trying to compare the complexity of in and binarySearch in python2. Expecting O(1) for in and O(logn) for binarySearch. However, the results are unexpected. Are programs timed incorrectly or there is another mistake?
Here is the code:
import time
x = [x for x in range(1000000)]
def Time_in(alist,item):
t1 = time.time()
found = item in alist
t2 = time.time()
timer = t2 - t1
return found, timer
def Time_binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
t1 = time.time()
while first<=last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
t2 = time.time()
timer = t2 - t1
return found, timer
print "binarySearch: ", Time_binarySearch(x, 600000)
print "in: ", Time_in(x, 600000)
The results are:

The binary search is going so fast that when you try to print the time it took, it just prints 0.0. Whereas using in takes long enough that you see the very small fraction of a second it took.
The reason that in does take longer is because this is a list, not a set or similar data structure; whereas with a set, membership testing is somewhere between O(1) and O(logn), in a list, every element has to be checked in order until there's a match, or the list is exhausted.
Here's some benchmarking code:
from __future__ import print_function
import bisect
import timeit
def binarysearch(alist, item):
first = 0
last = len(alist) - 1
found = False
while first <= last and not found:
midpoint = (first + last) // 2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint - 1
else:
first = midpoint + 1
return found
def bisect_index(alist, item):
idx = bisect.bisect_left(alist, item)
if idx != len(alist) and alist[idx] == item:
found = True
else:
found = False
return found
time_tests = [
(' 600 in list(range(1000))',
'600 in alist',
'alist = list(range(1000))'),
(' 600 in list(range(10000000))',
'600 in alist',
'alist = list(range(10000000))'),
(' 600 in set(range(1000))',
'600 in aset',
'aset = set(range(1000))'),
('6000000 in set(range(10000000))',
'6000000 in aset',
'aset = set(range(10000000))'),
('binarysearch(list(range(1000)), 600)',
'binarysearch(alist, 600)',
'from __main__ import binarysearch; alist = list(range(1000))'),
('binarysearch(list(range(10000000)), 6000000)',
'binarysearch(alist, 6000000)',
'from __main__ import binarysearch; alist = list(range(10000000))'),
('bisect_index(list(range(1000)), 600)',
'bisect_index(alist, 600)',
'from __main__ import bisect_index; alist = list(range(1000))'),
('bisect_index(list(range(10000000)), 6000000)',
'bisect_index(alist, 6000000)',
'from __main__ import bisect_index; alist = list(range(10000000))'),
]
for display, statement, setup in time_tests:
result = timeit.timeit(statement, setup, number=1000000)
print('{0:<45}{1}'.format(display, result))
And the results:
# Python 2.7
600 in list(range(1000)) 5.29039907455
600 in list(range(10000000)) 5.22499394417
600 in set(range(1000)) 0.0402979850769
6000000 in set(range(10000000)) 0.0390179157257
binarysearch(list(range(1000)), 600) 0.961972951889
binarysearch(list(range(10000000)), 6000000) 3.014950037
bisect_index(list(range(1000)), 600) 0.421462059021
bisect_index(list(range(10000000)), 6000000) 0.634694814682
# Python 3.4
600 in list(range(1000)) 8.578510413994081
600 in list(range(10000000)) 8.578105041990057
600 in set(range(1000)) 0.04088461003266275
6000000 in set(range(10000000)) 0.043901249999180436
binarysearch(list(range(1000)), 600) 1.6799193460028619
binarysearch(list(range(10000000)), 6000000) 6.099467994994484
bisect_index(list(range(1000)), 600) 0.5168328559957445
bisect_index(list(range(10000000)), 6000000) 0.7694612839259207
# PyPy 2.6.0 (Python 2.7.9)
600 in list(range(1000)) 0.122292041779
600 in list(range(10000000)) 0.00196599960327
600 in set(range(1000)) 0.101480007172
6000000 in set(range(10000000)) 0.00759720802307
binarysearch(list(range(1000)), 600) 0.242530822754
binarysearch(list(range(10000000)), 6000000) 0.189949035645
bisect_index(list(range(1000)), 600) 0.132127046585
bisect_index(list(range(10000000)), 6000000) 0.197204828262

Why do you expect O(1) when testing if an element is contained in a list?
If you don't know anything about the list (like that it is sorted as in your example) then you have to go through each element and compare it.
So you get O(N).
Python lists cannot assume anything about what you store in them, so they have to use a naive implementation for list.__contains__.
If you want a faster test, then you can try to use a dictionary or set.

Here are time complexities of all methods for lists in python:
So as it can be seen x in s is O(n), which is significantly slower than binarySearch O(logn).

Related

Fastest way to determine if an ordered sublist is in a large lists of lists?

Suppose I have a my_huge_list_of_lists with 2,000,000 lists in it, each list about 50 in length.
I want to shorten the 2,000,000 my_huge_list_of_lists by discarding sublists that do not contain two elements in the sequence.
So far I have:
# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a,b])]
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [b,a])]
The consecutiveness of the search term [a,b] or [b,a] is important so I can't use a set.issubset().
I find this slow. I'd like to speed it up. I've considered a few options like using an 'early exit' and statement:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if (a in x and not check_if_list_is_sublist(x, [a,b]))]
and less times in the for loop with an or statement:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not (check_if_list_is_sublist(x, [a,b])
or check_if_list_is_sublist(x, [b,a]))]
and also working on speeding up the function (WIP)
# https://stackoverflow.com/questions/48232080/the-fastest-way-to-check-if-the-sub-list-exists-on-the-large-list
def check_if_list_is_sublist(lst, sublst):
checks if a list appears in order in another larger list.
set_of_sublists = {tuple(sublst) for sublist in lst}
and done some searching on Stack Overflow; but can't think of a way because the number of times check_if_list_is_sublist() is called is len(my_huge_list) * 2.
edit: add some user data as requested
from random import randint
from string import ascii_lowercase
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
my_neighbor_search_fwd = [i,c]
my_neighbor_search_rev = my_neighbor_search_fwd.reverse()
Unpack the item in the n-sized subsequence into n variables. Then write a list comprehension to filter the list doing a check for a, b or b, a in the sub-list.e.g.
a, b = sublst
def checklst(lst, a, b):
l = len(lst)
start = 0
while True:
try:
a_index = lst.index(a, start)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
start = a_index + 1
if start == l:
return False
continue # keep looking at the next a
%timeit found = [l for l in lst if checklst(l, a, b)]
1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, I can't think of any clever algorithm checks to really reduce the amount of work here. However, you are doing a LOT of allocations in your code, and iterating too far. So, just moving some declarations out of the function a bit got me
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
Which reduced the run-time of your sample code above by about 50%. With a list this size, spawning some more processes and dividing the work would probably see a performance increase as well. Can't think of any way to really reduce the amount of comparisons though...
Although this isn't what you'd call an "answer" per se, but rather it's a benchmarking framework that should help you determine the quickest way to accomplish what you want because it allows relatively easy modification as well as the addition of different approaches.
I've put the answers currently posted into it, as well as the results of running it with them.
Caveats: Note that I haven't verified that all the tested answers in it are "correct" in the sense that they actually do what you want, nor how much memory they'll consume in the process—which might be another consideration.
Currently it appears that #Oluwafemi Sule's answer is the fastest by a order of magnitude (10x times) from the closest competitor.
from __future__ import print_function
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 10 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
from random import randint, randrange, seed
from string import ascii_lowercase
a, b = 'a', 'b'
NUM_SUBLISTS = 1000
SUBLIST_LEN = 50
PERCENTAGE = 50 # Percentage of sublist that should get removed.
seed(42) # Initialize random number so the results are reproducible.
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
for __ in range(NUM_SUBLISTS)]
# Put the target sequence in percentage of the sublists so they'll be removed.
for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
list_index = randrange(NUM_SUBLISTS)
sublist_index = randrange(SUBLIST_LEN)
my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"OP (Nas Banov)": TestCase("""
# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
''' Checks if a list appears in order in another larger list. '''
n = len(sublst)
return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a, b])]
"""
),
"Sphinx Solution 1 (hash)": TestCase("""
# https://stackoverflow.com/a/49518843/355230
# Solution 1: By using built-in hash function.
def prepare1(huge_list, interval=1): # Use built-in hash function.
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
def check_sublist1(hash_db, sublst): # Use built-in hash function.
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
""", """
hash_db = prepare1(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist1(hash_db, x)]
"""
),
"Sphinx Solution 2 (str)": TestCase("""
# https://stackoverflow.com/a/49518843/355230
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): # Use str() as hash function.
return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
for index in range(len(huge_list) - interval + 1)}
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
""", """
hash_db = prepare2(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist2(hash_db, x)]
"""
),
"Paul Becotte": TestCase("""
# https://stackoverflow.com/a/49504792/355230
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
"""
),
"Oluwafemi Sule": TestCase("""
# https://stackoverflow.com/a/49504440/355230
def checklst(lst, a, b):
try:
a_index = lst.index(a)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not checklst(x, a, b)]
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
.format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
print()
Output:
Results for 1,000 sublists of length 50 with 50% percent of them matching
Fastest to slowest execution speeds using Python 3.6.4
(10 executions, best of 3 repetitions)
Oluwafemi Sule : 0.006441 secs, rel speed 1.00x, 0.00% slower
Paul Becotte : 0.069462 secs, rel speed 10.78x, 978.49% slower
OP (Nas Banov) : 0.082758 secs, rel speed 12.85x, 1184.92% slower
Sphinx Solution 2 (str) : 0.152119 secs, rel speed 23.62x, 2261.84% slower
Sphinx Solution 1 (hash) : 0.154562 secs, rel speed 24.00x, 2299.77% slower
For search match in one large list, I believe hash(element) then build indexes will be one good solution.
The benefit you will get:
build indexes once, save your time for future use (don't need to loop again and again for each search).
Even, we can build indexes when launching the program, then release it when program exits,
Below codes use two methods to get hash value: hash() and str(); sometimes you should customize one hash function based on your specific scenarios.
If use str(), the codes seems simple, and don't need to consider the hash conflict. But it may cause memory bomb up.
For hash(), I used the list to save all sub_lst which has same hash value. and you can use hash(sub_lst)%designed_length to control hash size (but it will increase the hash conflict rate).
Output for below codes:
By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]
Test Codes:
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}
hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))
print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))
If you'd like to remove the matched elements from one list, it is doable, but the effect is you may have to rebuild the indexes for the new list. Unless the list is a chain list then save the pointer for each element in the indexes. I just google Python how to get the pointer for one element of a list, but can't find anything helpful. If someone knows how to do, please don't hesitate to share your solution. Thanks.
Below is one sample: (it generate one new list instead of return original one, sometimes we still need to filter something from original list)
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
else:
hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
return hash_db
hash_db = prepare(my_huge_list_of_lists, interval=2)
def check_sublist(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
return []
def remove_if_match_sublist(target_list, hash_db, sublsts):
matches = []
for sublst in sublsts:
matches += check_sublist(hash_db, sublst)
#make sure delete elements from end to begin
sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
new_list = list(target_list)
for item in sorted_match:
del new_list[item['beg']:item['end']]
return new_list
print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))

python-measure function time

I am having a problem with measuring the time of a function.
My function is a "linear search":
def linear_search(obj, item,):
for i in range(0, len(obj)):
if obj[i] == item:
return i
return -1
And I made another function that measures the time 100 times and adds all the results to a list:
def measureTime(a):
nl=[]
import random
import time
for x in range(0,100): #calculating time
start = time.time()
a
end =time.time()
times=end-start
nl.append(times)
return nl
When I'm using measureTime(linear_search(list,random.choice(range(0,50)))), the function always returns [0.0].
What can cause this problem? Thanks.
you are actually passing the result of linear_search into function measureTime, you need to pass in the function and arguments instead for them to be execute inside measureTime function like #martijnn2008 answer
Or better wise you can consider using timeit module to to the job for you
from functools import partial
import timeit
def measureTime(n, f, *args):
# return average runtime for n number of times
# use a for loop with number=1 to get all individual n runtime
return timeit.timeit(partial(f, *args), number=n)
# running within the module
measureTime(100, linear_search, list, random.choice(range(0,50)))
# if running interactively outside the module, use below, lets say your module name mymodule
mymodule.measureTime(100, mymodule.linear_search, mymodule.list, mymodule.random.choice(range(0,50)))
Take a look at the following example, don't know exactly what you are trying to achieve so I guessed it ;)
import random
import time
def measureTime(method, n, *args):
start = time.time()
for _ in xrange(n):
method(*args)
end = time.time()
return (end - start) / n
def linear_search(lst, item):
for i, o in enumerate(lst):
if o == item:
return i
return -1
lst = [random.randint(0, 10**6) for _ in xrange(10**6)]
repetitions = 100
for _ in xrange(10):
item = random.randint(0, 10**6)
print 'average runtime =',
print measureTime(linear_search, repetitions, lst, item) * 1000, 'ms'

Python not in dict condition sentence performance

Does anybody know about what is better to use thinking about speed and resources? Link to some trusted sources would be much appreciated.
if key not in dictionary.keys():
or
if not dictionary.get(key):
Firstly, you'd do
if key not in dictionary:
since dicts are iterated over by keys.
Secondly, the two statements are not equivalent - the second condition would be true if the corresponding values is falsy (0, "", [] etc.), not only if the key doesn't exist.
Lastly, the first method is definitely faster and more pythonic. Function/method calls are expensive. If you're unsure, timeit.
In my experience, using in is faster than using get, although the speed of get can be improved by caching the get method so it doesn't have to be looked up each time. Here are some timeit tests:
''' in vs get speed test
Comparing the speed of cache retrieval / update using `get` vs using `in`
http://stackoverflow.com/a/35451912/4014959
Written by PM 2Ring 2015.12.01
Updated for Python 3 2017.08.08
'''
from __future__ import print_function
from timeit import Timer
from random import randint
import dis
cache = {}
def get_cache(x):
''' retrieve / update cache using `get` '''
res = cache.get(x)
if res is None:
res = cache[x] = x
return res
def get_cache_defarg(x, get=cache.get):
''' retrieve / update cache using defarg `get` '''
res = get(x)
if res is None:
res = cache[x] = x
return res
def in_cache(x):
''' retrieve / update cache using `in` '''
if x in cache:
return cache[x]
else:
res = cache[x] = x
return res
#slow to fast.
funcs = (
get_cache,
get_cache_defarg,
in_cache,
)
def show_bytecode():
for func in funcs:
fname = func.__name__
print('\n%s' % fname)
dis.dis(func)
def time_test(reps, loops):
''' Print timing stats for all the functions '''
for func in funcs:
fname = func.__name__
print('\n%s: %s' % (fname, func.__doc__))
setup = 'from __main__ import data, ' + fname
cmd = 'for v in data: %s(v)' % (fname,)
times = []
t = Timer(cmd, setup)
for i in range(reps):
r = 0
for j in range(loops):
r += t.timeit(1)
cache.clear()
times.append(r)
times.sort()
print(times)
datasize = 1024
maxdata = 32
data = [randint(1, maxdata) for i in range(datasize)]
#show_bytecode()
time_test(3, 500)
typical output on my 2Ghz machine running Python 2.6.6:
get_cache: retrieve / update cache using `get`
[0.65624237060546875, 0.68499755859375, 0.76354193687438965]
get_cache_defarg: retrieve / update cache using defarg `get`
[0.54204297065734863, 0.55032730102539062, 0.56702113151550293]
in_cache: retrieve / update cache using `in`
[0.48754477500915527, 0.49125504493713379, 0.50087881088256836]
TLDR: Use if key not in dictionary. This is idiomatic, robust and fast.
There are four versions of relevance to this question: the 2 posed in the question, and the optimal variant of them:
key not in dictionary.keys() # inA
key not in dictionary # inB
not dictionary.get(key) # getA
sentinel = object()
dictionary.get(key, sentinel) is not sentinel # getB
Both A variants have shortcomings that mean you should not use them. inA needlessly creates a dict view on the keys - this adds an indirection step. getA looks at the truth of the value - this leads to incorrect results for values such as '' or 0.
As for using inB over getB: both do the same thing, namely looking at whether there is a value for key. However, getB also returns that value or default and has to compare it against the sentinel. Consequently, using get is considerably slower:
$ PREPARE="
> import random
> data = {a: True for a in range(0, 512, 2)}
> sentinel=object()"
$ python3 -m perf timeit -s "$PREPARE" '27 in data'
.....................
Mean +- std dev: 33.9 ns +- 0.8 ns
$ python3 -m perf timeit -s "$PREPARE" 'data.get(27, sentinel) is not sentinel'
.....................
Mean +- std dev: 105 ns +- 5 ns
Note that pypy3 has practically the same performance for both variants once the JIT has warmed up.
Ok, I've tested it on python 3.4.3 and all three ways give the same result around 0.00001 second.
import random
a = {}
for i in range(0, 1000000):
a[str(random.random())] = random.random()
import time
t1 = time.time(); 1 in a.keys(); t2 = time.time(); print("Time=%s" % (t2 - t1))
t1 = time.time(); 1 in a; t2 = time.time(); print("Time=%s" % (t2 - t1))
t1 = time.time(); not a.get(1); t2 = time.time(); print("Time=%s" % (t2 - t1))

List Comparison Algorithm: How can it be made better?

Running on Python 3.3
I am attempting to create an efficient algorithm to pull all of the similar elements between two lists. The problem is two fold. First, I can not seem to find any algorithms online. Second, there should be a more efficient way.
By 'similar elements', I mean two elements that are equal in value (be it string, int, whatever).
Currently, I am taking a greedy approach by:
Sorting the lists that are being compared,
Comparing each element in the shorter list to each element in the larger list,
Since the largeList and smallList are sorted we can save the last index that was visited,
Continue from the previous index (largeIndex).
Currently, the run-time seems to be average of O(nlog(n)). This can be seen by running the test cases listed after this block of code.
Right now, my code looks as such:
def compare(small,large,largeStart,largeEnd):
for i in range(largeStart, largeEnd):
if small==large[i]:
return [1,i]
if small<large[i]:
if i!=0:
return [0,i-1]
else:
return [0, i]
return [0,largeStart]
def determineLongerList(aList, bList):
if len(aList)>len(bList):
return (aList, bList)
elif len(aList)<len(bList):
return (bList, aList)
else:
return (aList, bList)
def compareElementsInLists(aList, bList):
import time
startTime = time.time()
holder = determineLongerList(aList, bList)
sameItems = []
iterations = 0
##########################################
smallList = sorted(holder[1])
smallLength = len(smallList)
smallIndex = 0
largeList = sorted(holder[0])
largeLength = len(largeList)
largeIndex = 0
while (smallIndex<smallLength):
boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
if boolean[0]==1:
#`compare` returns 1 as True
sameItems.append(smallList[smallIndex])
oldIndex = largeIndex
largeIndex = boolean[1]
else:
#else no match and possible new index
oldIndex = largeIndex
largeIndex = boolean[1]
smallIndex+=1
iterations =largeIndex-oldIndex+iterations+1
print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
return sameItems
, and here are some test cases:
def testLargest():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
#testLargest()
'''
One rendition of testLargest:
******************************************
CREATING LISTS TOOK: 21.009342908859253
******************************************
RAN 999998 OUT OF 1000000000000 POSSIBLE
RATIO: 9.99998e-07
COMPARING LISTS TOOK: 13.99990701675415
NUMBER OF SAME ITEMS: 632328
******************************************
'''
def testLarge():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,100)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,100)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
testLarge()
If you are just searching for all elements which are in both lists, you should use data types meant to handle such tasks. In this case, sets or bags would be appropriate. These are internally represented by hashing mechanisms which are even more efficient than searching in sorted lists.
(collections.Counter represents a suitable bag.)
If you do not care for doubled elements, then sets would be fine.
a = set(listA)
print a.intersection(listB)
This will print all elements which are in listA and in listB. (Without doubled output for doubled input elements.)
import collections
a = collections.Counter(listA)
b = collections.Counter(listB)
print a & b
This will print how many elements are how often in both lists.
I didn't make any measuring but I'm pretty sure these solutions are way faster than your self-made attempts.
To convert a counter into a list of all represented elements again, you can use list(c.elements()).
Using ipython magic for timeit but it doesn't compare favourably with just a standard set() intersection.
Setup:
import random
alist = [random.randint(0, 100000) for _ in range(1000)]
blist = [random.randint(0, 100000) for _ in range(1000)]
Compare Elements:
%%timeit -n 1000
compareElementsInLists(alist, blist)
1000 loops, best of 3: 1.9 ms per loop
Vs Set Intersection
%%timeit -n 1000
set(alist) & set(blist)
1000 loops, best of 3: 104 µs per loop
Just to make sure we get the same results:
>>> compareElementsInLists(alist, blist)
[8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791]
>>> set(alist) & set(blist)
{8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791}

Fast way to remove a few items from a list/queue

This is a follow up to a similar question which asked the best way to write
for item in somelist:
if determine(item):
code_to_remove_item
and it seems the consensus was on something like
somelist[:] = [x for x in somelist if not determine(x)]
However, I think if you are only removing a few items, most of the items are being copied into the same object, and perhaps that is slow. In an answer to another related question, someone suggests:
for item in reversed(somelist):
if determine(item):
somelist.remove(item)
However, here the list.remove will search for the item, which is O(N) in the length of the list. May be we are limited in that the list is represented as an array, rather than a linked list, so removing items will need to move everything after it. However, it is suggested here that collections.dequeue is represented as a doubly linked list. It should then be possible to remove in O(1) while iterating. How would we actually accomplish this?
Update:
I did some time testing as well, with the following code:
import timeit
setup = """
import random
random.seed(1)
b = [(random.random(),random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
"""
listcomp = """
c[:] = [x for x in b if tokeep(x)]
"""
filt = """
c = filter(tokeep, b)
"""
print "list comp = ", timeit.timeit(listcomp,setup, number = 10000)
print "filtering = ", timeit.timeit(filt,setup, number = 10000)
and got:
list comp = 4.01255393028
filtering = 3.59962391853
The list comprehension is the asymptotically optimal solution:
somelist = [x for x in somelist if not determine(x)]
It only makes one pass over the list, so runs in O(n) time. Since you need to call determine() on each object, any algorithm will require at least O(n) operations. The list comprehension does have to do some copying, but it's only copying references to the objects not copying the objects themselves.
Removing items from a list in Python is O(n), so anything with a remove, pop, or del inside the loop will be O(n**2).
Also, in CPython list comprehensions are faster than for loops.
If you need to remove item in O(1) you can use HashMaps
Since list.remove is equivalent to del list[list.index(x)], you could do:
for idx, item in enumerate(somelist):
if determine(item):
del somelist[idx]
But: you should not modify the list while iterating over it. It will bite you, sooner or later. Use filter or list comprehension first, and optimise later.
A deque is optimized for head and tail removal, not for arbitrary removal in the middle. The removal itself is fast, but you still have to traverse the list to the removal point. If you're iterating through the entire length, then the only difference between filtering a deque and filtering a list (using filter or a comprehension) is the overhead of copying, which at worst is a constant multiple; it's still a O(n) operation. Also, note that the objects in the list aren't being copied -- just the references to them. So it's not that much overhead.
It's possible that you could avoid copying like so, but I have no particular reason to believe this is faster than a straightforward list comprehension -- it's probably not:
write_i = 0
for read_i in range(len(L)):
L[write_i] = L[read_i]
if L[read_i] not in ['a', 'c']:
write_i += 1
del L[write_i:]
I took a stab at this. My solution is slower, but requires less memory overhead (i.e. doesn't create a new array). It might even be faster in some circumstances!
This code has been edited since its first posting
I had problems with timeit, I might be doing this wrong.
import timeit
setup = """
import random
random.seed(1)
global b
setup_b = [(random.random(), random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
return (x[1]>.45) and (x[1]<.5)
# define and call to turn into psyco bytecode (if using psyco)
b = setup_b[:]
def listcomp():
c[:] = [x for x in b if tokeep(x)]
listcomp()
b = setup_b[:]
def filt():
c = filter(tokeep, b)
filt()
b = setup_b[:]
def forfilt():
marked = (i for i, x in enumerate(b) if tokeep(x))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfilt()
b = setup_b[:]
def forfiltCheating():
marked = (i for i, x in enumerate(b) if (x[1] > .45) and (x[1] < .5))
shift = 0
for n in marked:
del b[n - shift]
shift += 1
forfiltCheating()
"""
listcomp = """
b = setup_b[:]
listcomp()
"""
filt = """
b = setup_b[:]
filt()
"""
forfilt = """
b = setup_b[:]
forfilt()
"""
forfiltCheating = '''
b = setup_b[:]
forfiltCheating()
'''
psycosetup = '''
import psyco
psyco.full()
'''
print "list comp = ", timeit.timeit(listcomp, setup, number = 10000)
print "filtering = ", timeit.timeit(filt, setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, setup, number = 10000)
print '\nnow with psyco \n'
print "list comp = ", timeit.timeit(listcomp, psycosetup + setup, number = 10000)
print "filtering = ", timeit.timeit(filt, psycosetup + setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, psycosetup + setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, psycosetup + setup, number = 10000)
And here are the results
list comp = 6.56407690048
filtering = 5.64738512039
forfilter = 7.31555104256
forfiltCheating = 4.8994679451
now with psyco
list comp = 8.0485959053
filtering = 7.79016900063
forfilter = 9.00477004051
forfiltCheating = 4.90830993652
I must be doing something wrong with psyco, because it is actually running slower.
elements are not copied by list comprehension
this took me a while to figure out. See the example code below, to experiment yourself with different approaches
code
You can specify how long a list element takes to copy and how long it takes to evaluate. The time to copy is irrelevant for list comprehension, as it turned out.
import time
import timeit
import numpy as np
def ObjectFactory(time_eval, time_copy):
"""
Creates a class
Parameters
----------
time_eval : float
time to evaluate (True or False, i.e. keep in list or not) an object
time_copy : float
time to (shallow-) copy an object. Used by list comprehension.
Returns
-------
New class with defined copy-evaluate performance
"""
class Object:
def __init__(self, id_, keep):
self.id_ = id_
self._keep = keep
def __repr__(self):
return f"Object({self.id_}, {self.keep})"
#property
def keep(self):
time.sleep(time_eval)
return self._keep
def __copy__(self): # list comprehension does not copy the object
time.sleep(time_copy)
return self.__class__(self.id_, self._keep)
return Object
def remove_items_from_list_list_comprehension(lst):
return [el for el in lst if el.keep]
def remove_items_from_list_new_list(lst):
new_list = []
for el in lst:
if el.keep:
new_list += [el]
return new_list
def remove_items_from_list_new_list_by_ind(lst):
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
return [lst[ee] for ee in new_list_inds]
def remove_items_from_list_del_elements(lst):
"""WARNING: Modifies lst"""
new_list_inds = []
for ee in range(len(lst)):
if lst[ee].keep:
new_list_inds += [ee]
for ind in new_list_inds[::-1]:
if not lst[ind].keep:
del lst[ind]
if __name__ == "__main__":
ClassSlowCopy = ObjectFactory(time_eval=0, time_copy=0.1)
ClassSlowEval = ObjectFactory(time_eval=1e-8, time_copy=0)
keep_ratio = .8
n_runs_timeit = int(1e2)
n_elements_list = int(1e2)
lsts_to_tests = dict(
list_slow_copy_remove_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_copy_keep_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_remove_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
list_slow_eval_keep_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
)
for lbl, lst in lsts_to_tests.items():
print()
for fct in [
remove_items_from_list_list_comprehension,
remove_items_from_list_new_list,
remove_items_from_list_new_list_by_ind,
remove_items_from_list_del_elements,
]:
lst_loc = lst.copy()
t = timeit.timeit(lambda: fct(lst_loc), number=n_runs_timeit)
print(f"{fct.__name__}, {lbl}: {t=}")
output
remove_items_from_list_list_comprehension, list_slow_copy_remove_many: t=0.0064229519994114526
remove_items_from_list_new_list, list_slow_copy_remove_many: t=0.006507338999654166
remove_items_from_list_new_list_by_ind, list_slow_copy_remove_many: t=0.006562008995388169
remove_items_from_list_del_elements, list_slow_copy_remove_many: t=0.0076057760015828535
remove_items_from_list_list_comprehension, list_slow_copy_keep_many: t=0.006243691001145635
remove_items_from_list_new_list, list_slow_copy_keep_many: t=0.007145451003452763
remove_items_from_list_new_list_by_ind, list_slow_copy_keep_many: t=0.007032064997474663
remove_items_from_list_del_elements, list_slow_copy_keep_many: t=0.007690364996960852
remove_items_from_list_list_comprehension, list_slow_eval_remove_many: t=1.2495998149970546
remove_items_from_list_new_list, list_slow_eval_remove_many: t=1.1657221479981672
remove_items_from_list_new_list_by_ind, list_slow_eval_remove_many: t=1.2621939050004585
remove_items_from_list_del_elements, list_slow_eval_remove_many: t=1.4632593330024974
remove_items_from_list_list_comprehension, list_slow_eval_keep_many: t=1.1344162709938246
remove_items_from_list_new_list, list_slow_eval_keep_many: t=1.1323430630000075
remove_items_from_list_new_list_by_ind, list_slow_eval_keep_many: t=1.1354237199993804
remove_items_from_list_del_elements, list_slow_eval_keep_many: t=1.3084568729973398
import collections
list1=collections.deque(list1)
for i in list2:
try:
list1.remove(i)
except:
pass
INSTEAD OF CHECKING IF ELEMENT IS THERE. USING TRY EXCEPT.
I GUESS THIS FASTER

Categories

Resources