Useful information:
For information on how to sort a list of various data types see:
How to sort (list/tuple) of lists/tuples?
.. and for information on how to perform a binary search on a sorted list see: Binary search (bisection) in Python
My question:
How can you neatly apply binary search (or another log(n) search algorithm) to a list of some data type, where the key is a inner-component of the data type itself? To keep the question simple we can use a list of tuples as an example:
x = [("a", 1), ("b",2), ("c",3)]
binary_search(x, "b") # search for "b", should return 1
# note how we are NOT searching for ("b",2) yet we want ("b",2) returned anyways
To simplify even further: we only need to return a single search result, not multiple if for example ("b",2) and ("b",3) both existed.
Better yet:
How can we modify the following simple code to perform the above operation?
from bisect import bisect_left
def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi
hi = hi if hi is not None else len(a) # hi defaults to len(a)
pos = bisect_left(a, x, lo, hi) # find insertion position
return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end
PLEASE NOTE: I am not looking for the complete algorithm itself. Rather, I am looking for the application of some of Python's standard(ish) libraries, and/or Python's other functionalities so that I can easily search a sorted list of some arbitrary data type at any time.
Thanks
Take advantage of how lexicographic ordering deals with tuples of unequal length:
# bisect_right would also work
index = bisect.bisect_left(x, ('b',))
It may sometimes be convenient to feed a custom sequence type to bisect:
class KeyList(object):
# bisect doesn't accept a key function, so we build the key into our sequence.
def __init__(self, l, key):
self.l = l
self.key = key
def __len__(self):
return len(self.l)
def __getitem__(self, index):
return self.key(self.l[index])
import operator
# bisect_right would *not* work for this one.
index = bisect.bisect_left(KeyList(x, operator.itemgetter(0)), 'b')
What about converting the list of tuples to a dict?
>>> d = dict([("a", 1), ("b",2), ("c",3)])
>>> d['b'] # 2
Related
Input
ONESTRING
STRINGTHREE
THREEFOUR
FOURFIVE
Output
ONESTRINGTHREEFOURFIVE
in python??
I think first i concatenate with 2 string then run a loop but this gives an error I don't know why can anyone help in in python?
WARNING
This solution is for a list of strings in arbitrary order. This means that EVERY possible pair of words must be checked for a common substring, which may require an enormous amount of memory if your list of strings is large.
Solution 1, allows for words with no common substrings to be concatenated if desired
import itertools
from typing import Set, Tuple, Dict, List
def get_match(pair: Tuple[str, str], min_overlap: int = 3) -> str:
a, b = pair
for i in range(min_overlap, min(map(len, pair)) + 1):
if a[-i:] == b[:i]:
return b[:i]
return ""
def links_joiners(strings: List[str]) -> Tuple[Dict[str, str], Set[str]]:
links, joiners = dict(), set()
for pair in itertools.permutations(strings, 2):
if (match := get_match(pair)):
joiners.add(match)
links.update((pair,))
return links, joiners
def get_ordered_strings(strings: List[str], links: Dict[str, str]) -> List[str]:
def find_order(node: str) -> int:
return 0 if node not in links else 1 + find_order(links[node])
return sorted(strings, key=find_order, reverse=True)
def join_strings(strings: List[str], joiners: Set[str]) -> str:
s = "".join(strings)
for j in joiners:
s = s.replace(j, "", 1)
return s
Usage:
strings = ["THREEFOUR",
"ONESTRING",
"STRINGTHREE",
"FOURFIVE"]
links, joiners = get_links_and_joiners(strings)
ordered_strings = get_ordered_strings(strings, links)
join_strings(ordered_strings, joiners)
Output:
'ONESTRINGTHREEFOURFIVE'
Explanation
First, itertools is part of the standard library; no need to install any third party packages for this solution.
Now, the links_joiners() function will take a list of strings and find all the pairs of strings with matching suffix-prefix pairs, putting those pairs into a links dictionary which looks like this:
{'ONESTRING': 'STRINGTHREE',
'THREEFOUR': 'FOURFIVE',
'STRINGTHREE': 'THREEFOUR'}
Notice these are not in order. This is because for an arbitrary list of strings we can't be sure the strings were in order in the first place, so we have to iterate over every permutation of strings exhaustively in order to ensure that we've covered all pairings.
Now, notice there's also a function called get_ordered_strings() with an inner function find_order(). The function get_ordered_strings() forms what is known as a closure, but that's not particularly important to understand right now. The find_order() function is recursive, here's how it works:
Given a node, if the node is not a key in the links dictionary we've reached the base case and return zero. Otherwise, move to step 2.
If node is present, add one to a recursive call to find_order on that new node.
So given a key, say "ONESTRING", the find_order() function will look at the value associated with that key, and if that value is also a key in the dictionary, look at its value, and so on until it reaches a value that isn't a key in the dictionary.
Here's the code for find_order() again:
def find_order(node: str) -> int:
if node not in links:
return 0
return 1 + find_order(links[node])
And here's what links looks like after calling links_joiners():
{'ONESTRING': 'STRINGTHREE',
'THREEFOUR': 'FOURFIVE',
'STRINGTHREE': 'THREEFOUR'}
Now trace an example call to find_order("ONESTRING"):
find_order("ONESTRING") = 1 + find_order("STRINGTHREE")
= 1 + (1 + find_order("THREEFOUR"))
= 1 + (1 + (1 + find_order("FOURFIVE"))) # Base case
= 1 + (1 + (1 + 0))
= 3
What this function is doing is finding how many pairwise connections can be made from a given starting string. Another way to think of it is that links is actually representing adjacencies in a (special case of a) DAG.
Essentially what we want to do is take the nodes THREEFOUR, ONESTRING, STRINGTHREE, FOURFIVE and construct the longest possible singly-linked list (a type of a DAG) from them:
ONESTRING -> STRINGTHREE -> THREEFOUR -> FOURFIVE
By passing a given "node" of this graph to find_order(), it will follow the graph all the way to the end. So ONESTRING travels a distance of 3 to get to the end, whereas THREEFOUR travels only a distance of 1.
Node: ONESTRING -> STRINGTHREE -> THREEFOUR -> FOURFIVE
Dist: 3 2 1 0
Now, by passing find_order to the built-in sorted() function, we can tell Python how we want our strings to be sorted, which, in this case is in reverse order, by distance. The result is this:
>>> strings = ['THREEFOUR', 'ONESTRING', 'STRINGTHREE', 'FOURFIVE']
>>> ordered_strings = get_ordered_strings(strings, links)
>>> ordered_strings
['ONESTRING', 'STRINGTHREE', 'THREEFOUR', 'FOURFIVE']
Now, by joining each string by their common substrings, we are constructing the longest possible string where the constraint is that each pair of strings must have a common substring in the correct position. In other words, ordered_strings represents the longest path in the DAG. Or more accurately, we've designed a DAG which will have the longest path, by using all the provided nodes, and putting them in the correct order.
From here, we join each string:
>>> s = "".join(ordered_strings)
>>> s
'ONESTRINGSTRINGTHREETHREEFOURFOURFIVE'
Then we remove one instance of each of the joiners:
for j in joiners:
s = s.replace(j, "", 1)
Solution 2, only concatenates overlapping strings
This solution reuses join_strings() and get_match() from above. It also uses the walrus operator := (Python 3.8+) but can easily be written without it.
def join_overlapping_pairs(strings: List[str]) -> str:
if len(strings) == 1:
return strings.pop()
matches = set()
for pair in itertools.permutations(strings, 2):
if (match := get_match(pair)):
matches.add(join_strings(pair, (match,)))
return join_overlapping_pairs(matches)
Here is generic solution according your provided example. Sequence must be ordered, otherwise it will not work.
from functools import reduce
s = [
"ONESTRING",
"STRINGTHREE",
"THREEFOUR",
"FOURFIVE",
]
def join_f(first, add):
i = 1
while add[:i] in first:
i += 1
return first + add[i-1:]
print(reduce(join_f, s))
May use difflib library, sample code for your reference
from difflib import SequenceMatcher
str1 = "ONESTRING"
str2 = "STRINGTHREE"
match = SequenceMatcher(None, str1, str2).find_longest_match(0, len(str1), 0, len(str2))
#match[a]=3, match[b]=0, match[size]=6
Assuming the words are in the connecting order:
words = ['ONESTRING',
'STRINGTHREE',
'THREEFOUR',
'FOURFIVE']
S = words[0]
for w in words[1:]:
S += w[next(i for i in range(1,len(w)) if S.endswith(w[:-i])):]
print(S)
'ONESTRINGTHREEFOURFIVE'
If the words are not in connecting order, a recursive approach can do it:
def combine(words,S=""):
if not words: return S
result = "" # result is shortest combo
for i,w in enumerate(words): # p is max overlap (end/start)
p = next((i for i in range(1,len(w)) if S.endswith(w[:-i])),0)
if result and not p: continue # check if can combine
combo = combine(words[:i]+words[i+1:],S+w[p:]) # candidate combo
if not result or len(combo)<len(result): # keep if shortest
result = combo or result
return result
Output:
words = ['ONESTRING',
'FOURFIVE',
'THREEFOUR',
'STRINGTHREE'
]
result = combine(words)
print(result)
'ONESTRINGTHREEFOURFIVE
I have a list stored in this format: [(int, (int, int)), ...]
My code looks like this for the first case.
heap.heappush(list_, (a, b)) # b is a tuple
while len(list_):
temp = heap.heappop(list_)[1]
Now my ideal implementation would be
list_.append(a, b) # b is a tuple
while len(list_):
list_.sort(key = lambda x: x[0])
temp = list_.pop(0)[1]
The second implementation causes issues in other parts of my code. Is there any reason the second is incorrect, and how could I correct it to work like the heapq
EDIT: I know heappop() pops the smallest value out, which is why I have sorted the list based off of the 'a' (which heappop uses too, I assume)
To work with heapq you have to be aware python implements min heaps. That means the element at index 0 is always the smallest.
Here is what you want implemented with heapq:
import heapq
from typing import Tuple
class MyObject:
def __init__(self, a: int, b: Tuple(int, int)):
self.a = a
self.b = b
def __lt__(self, other):
return self.a < other.a
l = [MyObject(..,..), MyObject(..,..),..,MyObject(..,..)] # your list
heapq.heapify(l) # convert l to a heap
heapq.heappop(l) # retrieves and removes the smallest element (which lies at index 0)
# After popping the smallest element, the heap internally rearranges the elements to get the new smallest value to index 0. I.e. it maintaines the "heap variant" automatically and you don't need to explicitly sort!
Notice:
I didn't need to sort. The heap is by nature a semi-sorted structure
I needed to create a dedicated class for my objects. This is cleaner than working with tuples of tuples and also allowed me to override less than lt so that the heap knows how to create the tree internally
Here is more detailed info for the avid reader :D
When you work with heaps then you shouldn't need to explicitly sort. A heap by nature is a semi-sorted topology that (in the case of python heapq) keeps the smallest element at index 0. You don't however, see the tree structure underneath (remember, a heap is a tree in whicheach parent node is smaller than all its children and descendants).
Moreover, you don't need to append the elements one by one. Rather use heapq.heapify(). This will also ensure that no redundant heapify-up/heapify-down operations. Otherwise, you will have a serious run time issue if your list is huge :)
I read that question about how to use bisect on a list of tuples, and I used that information to answer that question. It works, but I'd like a more generic solution.
Since bisect doesn't allow to specify a key function, if I have this:
import bisect
test_array = [(1,2),(3,4),(5,6),(5,7000),(7,8),(9,10)]
and I want to find the first item where x > 5 for those (x,y) tuples (not considering y at all, I'm currently doing this:
bisect.bisect_left(test_array,(5,10000))
and I get the correct result because I know that no y is greater than 10000, so bisect points me to the index of (7,8). Had I put 1000 instead, it would have been wrong.
For integers, I could do
bisect.bisect_left(test_array,(5+1,))
but in the general case when there may be floats, how to to that without knowing the max values of the 2nd element?
test_array = [(1,2),(3,4),(5.2,6),(5.2,7000),(5.3,8),(9,10)]
I have tried this:
bisect.bisect_left(test_array,(min_value+sys.float_info.epsilon,))
and it didn't work, but I have tried this:
bisect.bisect_left(test_array,(min_value+sys.float_info.epsilon*3,))
and it worked. But it feels like a bad hack. Any clean solutions?
As of Python 3.10, bisect finally supports key! So if you're on 3.10 or up, just use key. But if you're not...
bisect supports arbitrary sequences. If you need to use bisect with a key, instead of passing the key to bisect, you can build it into the sequence:
class KeyList(object):
# bisect doesn't accept a key function before 3.10,
# so we build the key into our sequence.
def __init__(self, l, key):
self.l = l
self.key = key
def __len__(self):
return len(self.l)
def __getitem__(self, index):
return self.key(self.l[index])
Then you can use bisect with a KeyList, with O(log n) performance and no need to copy the bisect source or write your own binary search:
bisect.bisect_right(KeyList(test_array, key=lambda x: x[0]), 5)
This is a (quick'n'dirty) bisect_left implementation that allows an arbitrary key function:
def bisect(lst, value, key=None):
if key is None:
key = lambda x: x
def bis(lo, hi=len(lst)):
while lo < hi:
mid = (lo + hi) // 2
if key(lst[mid]) < value:
lo = mid + 1
else:
hi = mid
return lo
return bis(0)
> from _operator import itemgetter
> test_array = [(1, 2), (3, 4), (4, 3), (5.2, 6), (5.2, 7000), (5.3, 8), (9, 10)]
> print(bisect(test_array, 5, key=itemgetter(0)))
3
This keeps the O(log_N) performance up since it does not assemble a new list of keys. The implementation of binary search is widely available, but this was taken straight from the bisect_left source.
It should also be noted that the list needs to be sorted with regard to the same key function.
For this:
...want to find the first item where x > 5 for those (x,y) tuples (not considering y at all)
Something like:
import bisect
test_array = [(1,2),(3,4),(5,6),(5,7000),(7,8),(9,10)]
first_elem = [elem[0] for elem in test_array]
print(bisect.bisect_right(first_elem, 5))
The bisect_right function will take the first index past, and since you're just concerned with the first element of the tuple, this part seems straight forward. ...still not generalising to a specific key function I realize.
As #Jean-FrançoisFabre pointed out, we're already processing the entire array, so using bisect may not even be very helpful.
Not sure if it's any quicker, but we could alternatively use something like itertools (yes, this is a bit ugly):
import itertools
test_array = [(1,2),(3,4),(5,6),(5,7000),(7,8),(9,10)]
print(itertools.ifilter(
lambda tp: tp[1][0]>5,
((ix, num) for ix, num in enumerate(test_array))).next()[0]
)
As an addition to the nice suggestions, I'd like to add my own answer which works with floats (as I just figured it out)
bisect.bisect_left(test_array,(min_value+abs(min_value)*sys.float_info.epsilon),))
would work (whether min_value is positive or not). epsilon multiplied by min_value is guaranteed to be meaningful when added to min_value (it is not absorbed/cancelled). So it's the closest greater value to min_value and bisect will work with that.
If you have only integers that will still be faster & clearer:
bisect.bisect_left(test_array,(min_value+1,))
I am trying to implement a user defined sort function, similar to the python List sort as in list.sort(cmp = None, key = None, reverse = False) for example.
Here is my code so far
from operator import itemgetter
class Sort:
def __init__(self, sList, key = itemgetter(0), reverse = False):
self._sList = sList
self._key = key
self._reverse = reverse
self.sort()
def sort(self):
for index1 in range(len(self._sList) - 1):
for index2 in range(index1, len(self._sList)):
if self._reverse == True:
if self._sList[index1] < self._sList[index2]:
self._sList[index1], self._sList[index2] = self._sList[index2], self._sList[index1]
else:
if self._sList[index1] > self._sList[index2]:
self._sList[index1], self._sList[index2] = self._sList[index2], self._sList[index1]
List = [[1 ,2],[3, 5],[5, 1]]
Sort(List, reverse = True)
print List
I have a really bad time when it comes to the key parameter.
More specifically, I would like to know if there is a way to code a list with optional indexes (similar to foo(*parameters) ).
I really hope you understand my question.
key is a function to convert the item to a criterion used for comparison.
Called with the item as the sole parameter, it returns a comparable value of your choice.
One classical key example for integers stored as string is:
lambda x : int(x)
so strings are sorted numerically.
In your algorithm, you would have to replace
self._sList[index1] < self._sList[index2]
by
self._key(self._sList[index1]) < self._key(self._sList[index2])
so the values computed from items are compared, rather than the items themselves.
note that Python 3 dropped the cmp method, and just kept key method.
also note that in your case, using itemgetter(0) as the key function works for subscriptable items such as list (sorting by first item only) or str (sorting by first character only).
I'm trying to make sorting of objects in my program as fast as possible. I know that sorting with a cmp function is deprecated in python 3.x and that using keys is faster, but I'm not sure how to get the same functionality without using a cmp function.
Here's the definition of the class:
class Topic:
def __init__(self, x, y, val):
self.id = val
self.x = x
self.y = y
I have a dictionary full of Topic to float key, value pairs and a list of Topics to be sorted. Each topic in the list of Topics to be sorted has an entry in this dictionary. I need to sort the list of topics by the value in the dictionary. If two topics have a difference in value <= .001, the topic with higher ID should come first.
Here's the current code I'm using:
topicToDistance = {}
# ...
# topicToDistance populated
# ...
topics = topicToDistance.keys()
def firstGreaterCmp(a, b):
if abs(topicToDistance[a]-topicToDistance[b]) <= .001:
if a.id < b.id:
return 1
if topicToDistance[a] > topicToDistance[b]:
return 1
return -1
# sorting by key may be faster than using a cmp function
topics.sort(cmp = firstGreaterCmp)
Any help making this as fast as possible would be appreciated.