Related
I am trying to write a function called find_it(seq) that, given list of numbers, returns the number that appears an odd amount of times.
I have tried rearranging the return and for loop.
and tried without the else clause.
can someone point out how to format it?
thanks
def find_it(seq):
#return i for i in seq if seq.count(i) % 2 == 1 else 0
for i in seq: return i if seq.count(i) % 2 == 1 else: pass
#this is my solution without the one line and without using count()
def find_it(seq):
dic = {}
for i in seq:
if i not in dic:
dic.update({i:1})
else:
dic[i] += 1
print(dic)
for item,num in dic.items():
if num % 2 == 1:
return item
If you insist on making one-liner loop I suggest you use generator with next, this will make the code more readable
def find_it(seq):
return next((i for i in seq if seq.count(i) % 2 == 1), None)
However the more efficient way will be a simple loop
def find_it(seq):
for i in seq:
if seq.count(i) % 2 == 1:
return i
def find_it(seq):
return set([el for el in seq if seq.count(el) % 2 == 1])
> print(find_it([1, 2, 3, 1, 1, 2, 3]))
{1}
This snippet returns a set of elements present an odd number of times in a list.
It's not as efficient as can be, as checked elements present multiple times are still counted. For example, count(2) returns an int of how many 2s are in the list, but because of how the loop is, the next time the program sees a 2, it stills calculates the .count of 2 even though it's done it before.
This can be rectified by removing all the occurrences of an element from the list after it's been checked, or ignore checked elements. I was unable to find a way to do this in one line as you requested.
I digress only because OP seems intent on efficiency.
For this problem, the technique used can be influenced by the data being processed. This is best explained by example. Here are six different ways to achieve the same objective.
from collections import Counter
from timeit import timeit
# even number of 1s, odd number of 2s
list_ = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2]
def find_it_1(seq):
for k, v in Counter(seq).items():
if v % 2:
return k
def find_it_2(seq):
for i in seq:
if seq.count(i) % 2 :
return i
def find_it_3(seq):
s = set()
for e in seq:
if e not in s:
if seq.count(e) % 2:
return e
s.add(e)
def find_it_4(seq):
return next((i for i in seq if seq.count(i) % 2), None)
def find_it_5(seq):
for e in set(seq):
if seq.count(e) % 2:
return e
def find_it_6(seq):
d = {}
for e in seq:
d[e] = d.get(e, 0) + 1
for k, v in d.items():
if v % 2:
return k
for func in find_it_1, find_it_2, find_it_3, find_it_4, find_it_5, find_it_6:
print(func.__name__, timeit(lambda: func(list_)))
Output:
find_it_1 1.627880711999751
find_it_2 2.23142556699986
find_it_3 0.9605982989996846
find_it_4 2.4646536830000514
find_it_5 0.6783656980001069
find_it_6 1.9190425920000962
Now, let's change the data as follows:
list_ = [2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
Note that there are 3 occurrences of 2 and that they're at the start of the list. This results in:
find_it_1 1.574513012999887
find_it_2 0.3627374699999564
find_it_3 0.4003442379998887
find_it_4 0.5936855530007961
find_it_5 0.674294768999971
find_it_6 1.8698847380001098
Quod Erat Demonstrandum
Suppose I have a my_huge_list_of_lists with 2,000,000 lists in it, each list about 50 in length.
I want to shorten the 2,000,000 my_huge_list_of_lists by discarding sublists that do not contain two elements in the sequence.
So far I have:
# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a,b])]
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [b,a])]
The consecutiveness of the search term [a,b] or [b,a] is important so I can't use a set.issubset().
I find this slow. I'd like to speed it up. I've considered a few options like using an 'early exit' and statement:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if (a in x and not check_if_list_is_sublist(x, [a,b]))]
and less times in the for loop with an or statement:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not (check_if_list_is_sublist(x, [a,b])
or check_if_list_is_sublist(x, [b,a]))]
and also working on speeding up the function (WIP)
# https://stackoverflow.com/questions/48232080/the-fastest-way-to-check-if-the-sub-list-exists-on-the-large-list
def check_if_list_is_sublist(lst, sublst):
checks if a list appears in order in another larger list.
set_of_sublists = {tuple(sublst) for sublist in lst}
and done some searching on Stack Overflow; but can't think of a way because the number of times check_if_list_is_sublist() is called is len(my_huge_list) * 2.
edit: add some user data as requested
from random import randint
from string import ascii_lowercase
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
my_neighbor_search_fwd = [i,c]
my_neighbor_search_rev = my_neighbor_search_fwd.reverse()
Unpack the item in the n-sized subsequence into n variables. Then write a list comprehension to filter the list doing a check for a, b or b, a in the sub-list.e.g.
a, b = sublst
def checklst(lst, a, b):
l = len(lst)
start = 0
while True:
try:
a_index = lst.index(a, start)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
start = a_index + 1
if start == l:
return False
continue # keep looking at the next a
%timeit found = [l for l in lst if checklst(l, a, b)]
1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, I can't think of any clever algorithm checks to really reduce the amount of work here. However, you are doing a LOT of allocations in your code, and iterating too far. So, just moving some declarations out of the function a bit got me
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
Which reduced the run-time of your sample code above by about 50%. With a list this size, spawning some more processes and dividing the work would probably see a performance increase as well. Can't think of any way to really reduce the amount of comparisons though...
Although this isn't what you'd call an "answer" per se, but rather it's a benchmarking framework that should help you determine the quickest way to accomplish what you want because it allows relatively easy modification as well as the addition of different approaches.
I've put the answers currently posted into it, as well as the results of running it with them.
Caveats: Note that I haven't verified that all the tested answers in it are "correct" in the sense that they actually do what you want, nor how much memory they'll consume in the process—which might be another consideration.
Currently it appears that #Oluwafemi Sule's answer is the fastest by a order of magnitude (10x times) from the closest competitor.
from __future__ import print_function
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 10 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
from random import randint, randrange, seed
from string import ascii_lowercase
a, b = 'a', 'b'
NUM_SUBLISTS = 1000
SUBLIST_LEN = 50
PERCENTAGE = 50 # Percentage of sublist that should get removed.
seed(42) # Initialize random number so the results are reproducible.
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
for __ in range(NUM_SUBLISTS)]
# Put the target sequence in percentage of the sublists so they'll be removed.
for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
list_index = randrange(NUM_SUBLISTS)
sublist_index = randrange(SUBLIST_LEN)
my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"OP (Nas Banov)": TestCase("""
# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
''' Checks if a list appears in order in another larger list. '''
n = len(sublst)
return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a, b])]
"""
),
"Sphinx Solution 1 (hash)": TestCase("""
# https://stackoverflow.com/a/49518843/355230
# Solution 1: By using built-in hash function.
def prepare1(huge_list, interval=1): # Use built-in hash function.
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
def check_sublist1(hash_db, sublst): # Use built-in hash function.
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
""", """
hash_db = prepare1(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist1(hash_db, x)]
"""
),
"Sphinx Solution 2 (str)": TestCase("""
# https://stackoverflow.com/a/49518843/355230
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): # Use str() as hash function.
return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
for index in range(len(huge_list) - interval + 1)}
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
""", """
hash_db = prepare2(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist2(hash_db, x)]
"""
),
"Paul Becotte": TestCase("""
# https://stackoverflow.com/a/49504792/355230
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
"""
),
"Oluwafemi Sule": TestCase("""
# https://stackoverflow.com/a/49504440/355230
def checklst(lst, a, b):
try:
a_index = lst.index(a)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not checklst(x, a, b)]
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
.format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
print()
Output:
Results for 1,000 sublists of length 50 with 50% percent of them matching
Fastest to slowest execution speeds using Python 3.6.4
(10 executions, best of 3 repetitions)
Oluwafemi Sule : 0.006441 secs, rel speed 1.00x, 0.00% slower
Paul Becotte : 0.069462 secs, rel speed 10.78x, 978.49% slower
OP (Nas Banov) : 0.082758 secs, rel speed 12.85x, 1184.92% slower
Sphinx Solution 2 (str) : 0.152119 secs, rel speed 23.62x, 2261.84% slower
Sphinx Solution 1 (hash) : 0.154562 secs, rel speed 24.00x, 2299.77% slower
For search match in one large list, I believe hash(element) then build indexes will be one good solution.
The benefit you will get:
build indexes once, save your time for future use (don't need to loop again and again for each search).
Even, we can build indexes when launching the program, then release it when program exits,
Below codes use two methods to get hash value: hash() and str(); sometimes you should customize one hash function based on your specific scenarios.
If use str(), the codes seems simple, and don't need to consider the hash conflict. But it may cause memory bomb up.
For hash(), I used the list to save all sub_lst which has same hash value. and you can use hash(sub_lst)%designed_length to control hash size (but it will increase the hash conflict rate).
Output for below codes:
By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]
Test Codes:
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}
hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))
print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))
If you'd like to remove the matched elements from one list, it is doable, but the effect is you may have to rebuild the indexes for the new list. Unless the list is a chain list then save the pointer for each element in the indexes. I just google Python how to get the pointer for one element of a list, but can't find anything helpful. If someone knows how to do, please don't hesitate to share your solution. Thanks.
Below is one sample: (it generate one new list instead of return original one, sometimes we still need to filter something from original list)
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
else:
hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
return hash_db
hash_db = prepare(my_huge_list_of_lists, interval=2)
def check_sublist(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
return []
def remove_if_match_sublist(target_list, hash_db, sublsts):
matches = []
for sublst in sublsts:
matches += check_sublist(hash_db, sublst)
#make sure delete elements from end to begin
sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
new_list = list(target_list)
for item in sorted_match:
del new_list[item['beg']:item['end']]
return new_list
print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))
If one was to attempt to find the indexes of an item in a list you could do it a couple different ways here is what I know to be the fastest:
aList = [123, 'xyz', 'zara','xyz', 'abc'];
indices = [i for i, x in enumerate(aList) if x == "xyz"]
print(indices)
Another way not pythonic and slower:
count = 0
indices = []
aList = [123, 'xyz', 'zara','xyz', 'abc'];
for i in range(0,len(aList):
if 'xyz' == aList[i]:
indices.append(i)
print(indices)
The first method is undoubtedly faster however what if you wanted to go faster, is there a way? For the first index using method:
aList = [123, 'xyz', 'zara','xyz', 'abc'];
print "Index for xyz : ", aList.index( 'xyz' )
is very fast but can't handle multiple indexes.
How might one go about speeding things up?
Use list.index(elem, start)! That uses a for loop in C (see its implementation list_index_impl function in the source of CPython's listobject.c).
Avoid looping through all the elements in Python, it is slower than in C.
def index_finder(lst, item):
"""A generator function, if you might not need all the indices"""
start = 0
while True:
try:
start = lst.index(item, start)
yield start
start += 1
except ValueError:
break
import array
def index_find_all(lst, item, results=None):
""" If you want all the indices.
Pass results=[] if you explicitly need a list,
or anything that can .append(..)
"""
if results is None:
length = len(lst)
results = (array.array('B') if length <= 2**8 else
array.array('H') if length <= 2**16 else
array.array('L') if length <= 2**32 else
array.array('Q'))
start = 0
while True:
try:
start = lst.index(item, start)
results.append(start)
start += 1
except ValueError:
return results
# Usage example
l = [1, 2, 3, 4, 5, 6, 7, 8] * 32
print(*index_finder(l, 1))
print(*index_find_all(l, 1))
def find(target, myList):
for i in range(len(myList)):
if myList[i] == target:
yield i
def find_with_list(myList, target):
inds = []
for i in range(len(myList)):
if myList[i] == target:
inds += i,
return inds
In [8]: x = range(50)*200
In [9]: %timeit [i for i,j in enumerate(x) if j == 3]
1000 loops, best of 3: 598 us per loop
In [10]: %timeit list(find(3,x))
1000 loops, best of 3: 607 us per loop
In [11]: %timeit find(3,x)
1000000 loops, best of 3: 375 ns per loop
In [55]: %timeit find_with_list(x,3)
1000 loops, best of 3: 618 us per loop
Assuming you want a list as your output:
All options seemed exhibit similar time performance for my test with the list comprehension being the fastest (barely).
If using a generator is acceptable, it's way faster than the other approaches. Though it doesn't account for actually iterating over the indices, nor does it store them, so the indices cannot be iterated over a second time.
Simply create a dictionary of item->index from the list of items using zip like so:
items_as_dict = dict(zip(list_of_items,range(0,len(list_of_items))))
index = items_as_dict(item)
To get the index of the item, you can use the dictionary.
aList = [123, 'xyz', 'zara','xyz', 'abc'];
#The following apporach works only on lists with unique values
aList = list(np.unique(aList));
dict = enumerate(aList);
# get inverse mapping of above dictionary, replace key with values
inv_dict = dict(zip(dict.values(),dict.keys()))
# to get index of item by value, use 'inv_dict' and to get value by index, use 'dict'
valueofItemAtIndex0 = dict[0]; # value = 123
indexofItemWithValue123 = inv_dict[123]; # index = 0
D=dict()
for i, item in enumerate(l):
if item not in D:
D[item] = [i]
else:
D[item].append(i)
Then simply call D[item] to get the indices that match. You'll give up initial calculation time but gain it during call time.
I used another way to find the index of a element in a list in Python 3:
def index_of(elem, a):
a_e = enumerate(a)
a_f = list(filter(lambda x: x[1] == elem, a_e))
if a_f:
return a_f[0][0]
else:
return -1
Some tests:
a=[1,2,3,4,2]
index_of(2,a)
This function always return the first occurrence of the element. If element ins't in the list, return -1. For my goals, that solution worked well.
I'm having some speed issues regarding the conversion of lists to dictionaries, where the following operation takes up about 90% of the total running time:
def list2dict(list_):
return_dict = {}
for idx, word in enumerate(list_):
if word in return_dict:
raise ValueError("duplicate string found in list: %s" % (word))
return_dict[word] = idx
return return_dict
I'm having troubles seeing what it is exactly that causes this. Are there any obvious bottlenecks that you see in the code, or suggestions on how to speed it up?
Thanks.
EDIT:
Figured I'd put this up top since it's bigger -- turns out that a minor tweak to OP's code gives a pretty big bump in performance.
def list2dict(list_): # OLD
return_dict = {}
for idx, word in enumerate(list_):
if word in return_dict: # this compare is happening every iteration!
raise ValueError("duplicate string found in list: %s" % (word))
return_dict[word] = idx
return return_dict
def list2dictNEW(list_): #NEW HOTNESS
return_dict = {}
for idx, word in enumerate(list_):
return_dict[word] = idx # overwrite if you want to, because...
if len(return_dict) == len(list_): return return_dict
# if the lengths aren't the same, something got overwritten so we
# won't return. If they ARE the same, toss it back with only one
# compare (rather than n compares in the original
else: raise ValueError("There were duplicates in list {}".format(list_))
DEMO:
>>> timeit(lambda: list2dictNEW(TEST))
1.9117132451798682
>>> timeit(lambda: list2dict(TEST)):
2.2543816669587216
# gains of a third of a second per million iterations!
# that's a 15.2% speed bost
No obvious answers, but you could try something like:
def list2dict(list_):
return_dict = dict()
for idx,word in enumerate(list_):
return_dict.setdefault(word,idx)
return return_dict
You could also build a set and do list.index since you say the lists are fairly small, but I'm GUESSING that would be slower rather than faster. This would need profiling to be sure (use timeit.timeit)
def list2dict(list_):
set_ = set(list_)
return {word:list_.index(word) for word in set_}
I took the liberty of running some profiles on a set of test data. Here are the results:
TEST = ['a','b','c','d','e','f','g','h','i','j'] # 10 items
def list2dictA(list_): # build set and index word
set_ = set(list_)
return {word:list_.index(word) for word in set_}
def list2dictB(list_): # setdefault over enumerate(list)
return_dict = dict()
for idx,word in enumerate(list_):
return_dict.setdefault(word,idx)
return return_dict
def list2dictC(list_): # dict comp over enumerate(list)
return_dict = {word:idx for idx,word in enumerate(list_)}
if len(return_dict) == len(list_):
return return_dict
else:
raise ValueError("Duplicate string found in list")
def list2dictD(list_): # Original example from Question
return_dict = {}
for idx, word in enumerate(list_):
if word in return_dict:
raise ValueError("duplicate string found in list: %s" % (word))
return_dict[word] = idx
return return_dict
>>> timeit(lambda: list2dictA(TEST))
5.336584700190931
>>> timeit(lambda: list2dictB(TEST))
2.7587691306531
>>> timeit(lambda: list2dictC(TEST))
2.1609074989233292
>>> timeit(lambda: list2dictD(TEST))
2.2543816669587216
The fastest function depends on the length of list_. adsmith's list2dictC() is pretty fast for small lists (less than ~80 items). But when the list size grow I find my list2dictE() about 8% faster.
def list2dictC(list_): # dict comp over enumerate(list)
return_dict = {word: idx for idx, word in enumerate(list_)}
if len(return_dict) == len(list_):
return return_dict
else:
raise ValueError("Duplicate string found in list")
def list2dictE(list_): # Faster for lists with ~80 items or more
l = len(list_)
return_dict = dict(zip(list_, range(l)))
if len(return_dict) == l:
return return_dict
else:
raise ValueError("Duplicate string found in list")
If the length is known to be small keep it's of no use, but if it's not the case maybe add something like l = len(list_); if l < 80: ... else: .... It's just an additional if statement since both functions need to know the list length anyways. The threshold of 80 items may very depending on your setup, but it's in that bollpark for me both for python 2.7 and 3.3.
def l2d(list_):
dupes = set(filter(lambda x: a.count(x) > 1, a))
if len(dupes) > 0:
raise ValueError('duplicates: %s' % (dupes))
return dict((word, idx) for (idx, word) in enumerate(list_))
How does this compare?
Using pandas seems to speed it up by a third, though as said it's already possibly "fast enough"
> TEST = # english scrabble dictionary, 83882 entries
> def mylist2dict(list_):
return_dict = pd.Series(index=list_, data=np.arange(len(list_)))
if not return_dict.index.is_unique:
raise ValueError
return return_dict
> %timeit list2dict(TEST)
10 loops, best of 3: 28.8 ms per loop
> %timeit mylist2dict(TEST)
100 loops, best of 3: 18.8 ms per loop
This came in at avg 2.36 µs per run. Because the OPs code looked like it was favoring the first incident of a value I reversed the range so that it doesn't need any logic to check for the existence of values.
def mkdct(dct):
l = len(dct-1)
return {dct[x]:(l-x) for x in range(len(dct), -1, -1)}
Edit: there were some dumb mistakes there.
I have a set containing ~300.000 tuples
In [26]: sa = set(o.node for o in vrts_l2_5)
In [27]: len(sa)
Out[27]: 289798
In [31]: random.sample(sa, 1)
Out[31]: [('835644', '4696507')]
Now I want to lookup elements based on a common substring, e.g. the first 4 'digits' (in fact the elements are strings). This is my approach:
def lookup_set(x_appr, y_appr):
return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]
In [36]: lookup_set('6652','46529')
Out[36]: [('665274', '4652941'), ('665266', '4652956')]
Is there a more efficient, that is, faster way to to this?
You can do it in O(log(n) + m) time, where n is the number of tuples and m is the number of matching tuples, if you can afford to keep two sorted copies of the tuples.
Sorting itself will cost O(nlog(n)), i.e. it will be asymptotically slower then your naive approach, but if you have to do a certain number of queries(more than log(n), which is almost certainly quite small) it will pay off.
The idea is that you can use bisection to find the candidates that have the correct first value and the correct second value and then intersect these sets.
However note that you want a strange kind of comparison: you care for all strings starting with the given argument. This simply means that when searching for the right-most occurrence you should fill the key with 9s.
A complete working(although not tested very much) code:
from random import randint
from operator import itemgetter
first = itemgetter(0)
second = itemgetter(1)
sa = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
f_sorted = sorted(sa, key=first)
s_sorted = sa
s_sorted.sort(key=second)
max_length = max(len(s) for _,s in sa)
# See: bisect module from stdlib
def bisect_right(seq, element, key):
lo = 0
hi = len(seq)
element = element.ljust(max_length, '9')
while lo < hi:
mid = (lo+hi)//2
if element < key(seq[mid]):
hi = mid
else:
lo = mid + 1
return lo
def bisect_left(seq, element, key):
lo = 0
hi = len(seq)
while lo < hi:
mid = (lo+hi)//2
if key(seq[mid]) < element:
lo = mid + 1
else:
hi = mid
return lo
def lookup_set(x_appr, y_appr):
x_left = bisect_left(f_sorted, x_appr, key=first)
x_right = bisect_right(f_sorted, x_appr, key=first)
x_candidates = f_sorted[x_left:x_right + 1]
y_left = bisect_left(s_sorted, y_appr, key=second)
y_right = bisect_right(s_sorted, y_appr, key=second)
y_candidates = s_sorted[y_left:y_right + 1]
return set(x_candidates).intersection(y_candidates)
And the comparison with your initial solution:
In [2]: def lookup_set2(x_appr, y_appr):
...: return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]
In [3]: lookup_set('123', '124')
Out[3]: set([])
In [4]: lookup_set2('123', '124')
Out[4]: []
In [5]: lookup_set('123', '125')
Out[5]: set([])
In [6]: lookup_set2('123', '125')
Out[6]: []
In [7]: lookup_set('12', '125')
Out[7]: set([('12478', '125908'), ('124625', '125184'), ('125494', '125940')])
In [8]: lookup_set2('12', '125')
Out[8]: [('124625', '125184'), ('12478', '125908'), ('125494', '125940')]
In [9]: %timeit lookup_set('12', '125')
1000 loops, best of 3: 589 us per loop
In [10]: %timeit lookup_set2('12', '125')
10 loops, best of 3: 145 ms per loop
In [11]: %timeit lookup_set('123', '125')
10000 loops, best of 3: 102 us per loop
In [12]: %timeit lookup_set2('123', '125')
10 loops, best of 3: 144 ms per loop
As you can see this solution is about 240-1400 times faster(in these examples) than your naive approach.
If you have a big set of matches:
In [19]: %timeit lookup_set('1', '2')
10 loops, best of 3: 27.1 ms per loop
In [20]: %timeit lookup_set2('1', '2')
10 loops, best of 3: 152 ms per loop
In [21]: len(lookup_set('1', '2'))
Out[21]: 3587
In [23]: %timeit lookup_set('', '2')
10 loops, best of 3: 182 ms per loop
In [24]: %timeit lookup_set2('', '2')
1 loops, best of 3: 212 ms per loop
In [25]: len(lookup_set2('', '2'))
Out[25]: 33053
As you can see this solution is faster even if the number of matches is about 10% of the total size. However, if you try to match all the data:
In [26]: %timeit lookup_set('', '')
1 loops, best of 3: 360 ms per loop
In [27]: %timeit lookup_set2('', '')
1 loops, best of 3: 221 ms per loop
It becomes (not so much) slower, although this is a quite peculiar case, and I doubt you'll frequently match almost all the elements.
Note that the time take to sort the data is quite small:
In [13]: from random import randint
...: from operator import itemgetter
...:
...: first = itemgetter(0)
...: second = itemgetter(1)
...:
...: sa2 = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
In [14]: %%timeit
...: f_sorted = sorted(sa2, key=first)
...: s_sorted = sorted(sa2, key=second)
...: max_length = max(len(s) for _,s in sa2)
...:
1 loops, best of 3: 881 ms per loop
As you can see it takes less than one second to do the two sorted copies. Actually the above code would be slightly faster since it sorts "in-place" the second copy(although tim-sort could still require O(n) memory).
This means that if you have to do more than about 6-8 queries this solution will be faster.
Note: python'd standard library provides a bisect module. However it doesn't allow a key parameter(even though I remember reading that Guido wanted it, so it may be added in the future). Hence if you want to use it directly, you'll have to use the "decorate-sort-undecorate" idiom.
Instead of:
f_sorted = sorted(sa, key=first)
You should do:
f_sorted = sorted((first, (first,second)) for first,second in sa)
I.e. you explicitly insert the key as the first element of the tuple. Afterwards you could use ('123', '') as element to pass to the bisect_* functions and it should find the correct index.
I decided to avoid this. I copy pasted the code from the sources of the module and slightly modified it to provide a simpler interface for your use-case.
Final remark: if you could convert the tuple elements to integers then the comparisons would be faster. However, most of the time would still be taken to perform the intersection of the sets, so I don't know exactly how much it will improve performances.
You could use a trie data structure. It is possible to build one with a tree of dict objects (see How to create a TRIE in Python) but there is a package marisa-trie that implements a memory-efficient version by binding to c++ libraries
I have not used this library before, but playing around with it, I got this working:
from random import randint
from marisa_trie import RecordTrie
sa = [(str(randint(1000000,9999999)),str(randint(1000000,9999999))) for i in range(100000)]
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip((unicode(first) for first, _ in sa), sa)),
RecordTrie(fmt, zip((unicode(second) for _, second in sa), sa)))
def lookup_set(sa_tries, x_appr, y_appr):
"""lookup prefix in the appropriate trie and intersect the result"""
return (set(item[1] for item in sa_tries[0].items(unicode(x_appr))) &
set(item[1] for item in sa_tries[1].items(unicode(y_appr))))
lookup_set(sa_tries, "2", "4")
I went through and implemented the 4 suggested solutions to compare their efficiency. I ran the tests with different prefix lengths to see how the input would affect performance. The trie and sorted list performance is definitely sensitive to the length of input with both getting faster as the input gets longer (I think it is actually sensitivity to the size of output since the output gets smaller as the prefix gets longer). However, the sorted set solution is definitely faster in all situations.
In these timing tests, there were 200000 tuples in sa and 10 runs for each method:
for prefix length 1
lookup_set_startswith : min=0.072107 avg=0.073878 max=0.077299
lookup_set_int : min=0.030447 avg=0.037739 max=0.045255
lookup_set_trie : min=0.111548 avg=0.124679 max=0.147859
lookup_set_sorted : min=0.012086 avg=0.013643 max=0.016096
for prefix length 2
lookup_set_startswith : min=0.066498 avg=0.069850 max=0.081271
lookup_set_int : min=0.027356 avg=0.034562 max=0.039137
lookup_set_trie : min=0.006949 avg=0.010091 max=0.032491
lookup_set_sorted : min=0.000915 avg=0.000944 max=0.001004
for prefix length 3
lookup_set_startswith : min=0.065708 avg=0.068467 max=0.079485
lookup_set_int : min=0.023907 avg=0.033344 max=0.043196
lookup_set_trie : min=0.000774 avg=0.000854 max=0.000929
lookup_set_sorted : min=0.000149 avg=0.000155 max=0.000163
for prefix length 4
lookup_set_startswith : min=0.065742 avg=0.068987 max=0.077351
lookup_set_int : min=0.026766 avg=0.034558 max=0.052269
lookup_set_trie : min=0.000147 avg=0.000167 max=0.000189
lookup_set_sorted : min=0.000065 avg=0.000068 max=0.000070
Here's the code:
import random
def random_digits(num_digits):
return random.randint(10**(num_digits-1), (10**num_digits)-1)
sa = [(str(random_digits(6)),str(random_digits(7))) for _ in range(200000)]
### naive approach
def lookup_set_startswith(x_appr, y_appr):
return [item for item in sa if item[0].startswith(x_appr) and item[1].startswith(y_appr) ]
### trie approach
from marisa_trie import RecordTrie
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip([unicode(first) for first, second in sa], sa)),
RecordTrie(fmt, zip([unicode(second) for first, second in sa], sa)))
def lookup_set_trie(x_appr, y_appr):
# lookup prefix in the appropriate trie and intersect the result
return set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & \
set(item[1] for item in sa_tries[1].items(unicode(y_appr)))
### int approach
sa_ints = [(int(first), int(second)) for first, second in sa]
sa_lens = tuple(map(len, sa[0]))
def lookup_set_int(x_appr, y_appr):
x_limit = 10**(sa_lens[0]-len(x_appr))
y_limit = 10**(sa_lens[1]-len(y_appr))
x_int = int(x_appr) * x_limit
y_int = int(y_appr) * y_limit
return [sa[i] for i, int_item in enumerate(sa_ints) \
if (x_int <= int_item[0] and int_item[0] < x_int+x_limit) and \
(y_int <= int_item[1] and int_item[1] < y_int+y_limit) ]
### sorted set approach
from operator import itemgetter
first = itemgetter(0)
second = itemgetter(1)
sa_sorted = (sorted(sa, key=first), sorted(sa, key=second))
max_length = max(len(s) for _,s in sa)
# See: bisect module from stdlib
def bisect_right(seq, element, key):
lo = 0
hi = len(seq)
element = element.ljust(max_length, '9')
while lo < hi:
mid = (lo+hi)//2
if element < key(seq[mid]):
hi = mid
else:
lo = mid + 1
return lo
def bisect_left(seq, element, key):
lo = 0
hi = len(seq)
while lo < hi:
mid = (lo+hi)//2
if key(seq[mid]) < element:
lo = mid + 1
else:
hi = mid
return lo
def lookup_set_sorted(x_appr, y_appr):
x_left = bisect_left(sa_sorted[0], x_appr, key=first)
x_right = bisect_right(sa_sorted[0], x_appr, key=first)
x_candidates = sa_sorted[0][x_left:x_right]
y_left = bisect_left(sa_sorted[1], y_appr, key=second)
y_right = bisect_right(sa_sorted[1], y_appr, key=second)
y_candidates = sa_sorted[1][y_left:y_right]
return set(x_candidates).intersection(y_candidates)
####
# test correctness
ntests = 10
candidates = [lambda x, y: set(lookup_set_startswith(x,y)),
lambda x, y: set(lookup_set_int(x,y)),
lookup_set_trie,
lookup_set_sorted]
print "checking correctness (or at least consistency)..."
for dlen in range(1,5):
print "prefix length %d:" % dlen,
for i in range(ntests):
print " #%d" % i,
prefix = map(str, (random_digits(dlen), random_digits(dlen)))
answers = [c(*prefix) for c in candidates]
for i, ans in enumerate(answers):
for j, ans2 in enumerate(answers[i+1:]):
assert ans == ans2, "answers for %s for #%d and #%d don't match" \
% (prefix, i, j+i+1)
print
####
# time calls
import timeit
import numpy as np
ntests = 10
candidates = [lookup_set_startswith,
lookup_set_int,
lookup_set_trie,
lookup_set_sorted]
print "timing..."
for dlen in range(1,5):
print "for prefix length", dlen
times = [ [] for c in candidates ]
for _ in range(ntests):
prefix = map(str, (random_digits(dlen), random_digits(dlen)))
for c, c_times in zip(candidates, times):
tstart = timeit.default_timer()
trash = c(*prefix)
c_times.append(timeit.default_timer()-tstart)
for c, c_times in zip(candidates, times):
print " %-25s: min=%f avg=%f max=%f" % (c.func_name, min(c_times), np.mean(c_times), max(c_times))
Integer manipulation is much faster than string. (and smaller in memory as well)
So if you can compare integers instead you'll be much faster.
I suspect something like this should work for you:
sa = set(int(o.node) for o in vrts_l2_5)
Then this may work for you:
def lookup_set(samples, x_appr, x_len, y_appr, y_len):
"""
x_appr == SSS0000 where S is the digit to search for
x_len == number of digits to S (if SSS0000 then x_len == 4)
"""
return ((x, y) for x, y in samples if round(x, -x_len) == x_appr and round(y, -y_len) == y_approx)
Also, it returns a generator, so you're not loading all the results into memory at once.
Updated to use round method mentioned by Bakuriu
There may be, but not by terribly much. str.startswith and and are both shortcutting operators (they can return once they find a failure), and indexing tuples is a fast operation. Most of the time spent here will be from object lookups, such as finding the startswith method for each string. Probably the most worthwhile option is to run it through Pypy.
A faster solution would be to create a dictionary dict and put the first value as a key and the second as a value.
Then you would search keys matching x_appr in the ordered key list of dict (the ordered list would allow you to optimize the search in key list with a dichotomy for example). This will provide a key list named for example k_list.
And then lookup for values of dict having a key in k_list and matching y_appr.
You can also include the second step (value that match y_appr) before appending to k_list. So that k_list will contains all the key of the correct elements of dict.
Here I've just compare 'in' method and 'find' method:
The CSV input file contains a list of URL
# -*- coding: utf-8 -*-
### test perfo str in set
import re
import sys
import time
import json
import csv
import timeit
cache = set()
#######################################################################
def checkinCache(c):
global cache
for s in cache:
if c in s:
return True
return False
#######################################################################
def checkfindCache(c):
global cache
for s in cache:
if s.find(c) != -1:
return True
return False
#######################################################################
print "1/3-loading pages..."
with open("liste_all_meta.csv.clean", "rb") as f:
reader = csv.reader(f, delimiter=",")
for i,line in enumerate(reader):
cache.add(re.sub("'","",line[2].strip()))
print " "+str(len(cache))+" PAGES IN CACHE"
print "2/3-test IN..."
tstart = timeit.default_timer()
for i in range(0, 1000):
checkinCache("string to find"+str(i))
print timeit.default_timer()-tstart
print "3/3-test FIND..."
tstart = timeit.default_timer()
for i in range(0, 1000):
checkfindCache("string to find"+str(i))
print timeit.default_timer()-tstart
print "\n\nBYE\n"
results in seconds:
1/3-loading pages...
482897 PAGES IN CACHE
2/3-test IN...
107.765980005
3/3-test FIND...
167.788629055
BYE
so, the 'in' method is faster than 'find' method :)
Have fun