Complex Sorting a List of Objects Efficiently - python

I'm trying to make sorting of objects in my program as fast as possible. I know that sorting with a cmp function is deprecated in python 3.x and that using keys is faster, but I'm not sure how to get the same functionality without using a cmp function.
Here's the definition of the class:
class Topic:
def __init__(self, x, y, val):
self.id = val
self.x = x
self.y = y
I have a dictionary full of Topic to float key, value pairs and a list of Topics to be sorted. Each topic in the list of Topics to be sorted has an entry in this dictionary. I need to sort the list of topics by the value in the dictionary. If two topics have a difference in value <= .001, the topic with higher ID should come first.
Here's the current code I'm using:
topicToDistance = {}
# ...
# topicToDistance populated
# ...
topics = topicToDistance.keys()
def firstGreaterCmp(a, b):
if abs(topicToDistance[a]-topicToDistance[b]) <= .001:
if a.id < b.id:
return 1
if topicToDistance[a] > topicToDistance[b]:
return 1
return -1
# sorting by key may be faster than using a cmp function
topics.sort(cmp = firstGreaterCmp)
Any help making this as fast as possible would be appreciated.

Related

Python: Search a sorted list of tuples

Useful information:
For information on how to sort a list of various data types see:
How to sort (list/tuple) of lists/tuples?
.. and for information on how to perform a binary search on a sorted list see: Binary search (bisection) in Python
My question:
How can you neatly apply binary search (or another log(n) search algorithm) to a list of some data type, where the key is a inner-component of the data type itself? To keep the question simple we can use a list of tuples as an example:
x = [("a", 1), ("b",2), ("c",3)]
binary_search(x, "b") # search for "b", should return 1
# note how we are NOT searching for ("b",2) yet we want ("b",2) returned anyways
To simplify even further: we only need to return a single search result, not multiple if for example ("b",2) and ("b",3) both existed.
Better yet:
How can we modify the following simple code to perform the above operation?
from bisect import bisect_left
def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi
hi = hi if hi is not None else len(a) # hi defaults to len(a)
pos = bisect_left(a, x, lo, hi) # find insertion position
return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end
PLEASE NOTE: I am not looking for the complete algorithm itself. Rather, I am looking for the application of some of Python's standard(ish) libraries, and/or Python's other functionalities so that I can easily search a sorted list of some arbitrary data type at any time.
Thanks
Take advantage of how lexicographic ordering deals with tuples of unequal length:
# bisect_right would also work
index = bisect.bisect_left(x, ('b',))
It may sometimes be convenient to feed a custom sequence type to bisect:
class KeyList(object):
# bisect doesn't accept a key function, so we build the key into our sequence.
def __init__(self, l, key):
self.l = l
self.key = key
def __len__(self):
return len(self.l)
def __getitem__(self, index):
return self.key(self.l[index])
import operator
# bisect_right would *not* work for this one.
index = bisect.bisect_left(KeyList(x, operator.itemgetter(0)), 'b')
What about converting the list of tuples to a dict?
>>> d = dict([("a", 1), ("b",2), ("c",3)])
>>> d['b'] # 2

How to iterate over nested data when there is no reliable order but need of accessing and checking all elements of the lowest level?

I came across this question in a very specific context but I soon realized that it has a quite general relevance.
FYI: I'm getting data from a framework and at a point I have transformed it into a list of unordered pairs (could be list of lists or tupels of any size as well but atm. I have 100% pairs). In my case these pairs are representing relationships between data objects and I want to refine my data.
I have a list of unordered tupels and want a list of objects or in this case a dict of dicts. If the same letter indicates the same class and differing numbers indicate different instances I want to accomplish this transformation:
[(a1, x1), (x2, a2), (y1, a2), (y1, a1)] -> {a1:{"y":y1,"x":x1},a2:{"y":y1,"x":x2}}
Note that there can be many "a"s that are connected to the same "x" or "y" but every "a" has at most one "x" or "y" each and that I can't rely on neither the order of the tupels nor the order of the tupel's elements (because the framework does not make a difference between "a" and "x") and I obviously don't care about the order of elements in my dicts - I just need the proper relations. There are many other pairs I don't care about and they can contain "a" elements, "y" elements or "x" elements as well
So the main question is "How to iterate over nested data when there is no reliable order but a need of accessing and checking all elements of the lowest level?"
I tried it in several ways but they don't seem right. For simplicity I just check for A-X pairs here:
def first_draft(list_of_pairs):
result = {}
for pair in list_of_pairs:
if pair[0].__cls__ is A and pair[1].__class__ is X:
result[pair[0]] = {"X": pair[1]}
if pair[0].__cls__ is X and pair[1].__class__ is A:
result[pair[1]] = {"X": pair[0]}
return result
def second_draft(list_of_pairs):
result = {}
for pair in list_of_pairs:
for index, item in enumerate(pair):
if item.__cls__ is A:
other_index = (index + 1) % 2
if pair[other_index].__class__ is X:
result[item] = {"X":pair[other_index]}
return result
def third_draft(list_of_pairs):
result = {}
for pair in list_of_pairs:
for item in pair:
if item.__class__ is A:
for any_item in pair:
if any_item.__class__ is X:
result[item] = {"X":any_item}
return result
The third draft actually works for every size of sub lists and got rid of any non pythonic integer usage but iterating over the same list while iterating over itself? And quintuple nesting for just one line of code? That does not seem right to me and I learned "When there is a problem according to iteration in python and you don't know a good solution - there is a great solution in itertools!" - I just didn't find one.
Does someone now a buildin that can help me or simply a better way to implement my methods?
You can do something like this with strings:
l = [('a1', 'x1','z3'), ('x2', 'a2'), ('y1', 'a2'), ('y1', 'a1')]
res = {}
for tup in l:
main_class = ""
sub_classes = ""
for item in tup:
if item.startswith('a'):
main_class = item
sub_classes = list(tup)
sub_classes.remove(main_class)
if not main_class in res:
res[main_class] = {}
for item in sub_classes:
res[main_class][item[0]] = item[-1]
If your objects aren't strings, you just need to change if a.startswith('a'): to something that determines whether the first item in your pair should be the key or not.
This also handles tuples greater than length two. It iterates each tuple, finding the "main class", and then removes it from a list version of the tuple (so that the new list is all the sub classes).
Looks like Ned Batchelder (who said that every time one have a problem with iterables and don't think there is a nice solution in Python there is a solution in itertools) was right. I finally found a solution I overlooked last time: the permutations method
def final_draft(list_of_pairs):
result = {}
for pair in list_of_pairs:
for permutation in permutations(pair):
if permutation[0].__class__ is A:
my_a = permutation[0]
if permutation[1].__class__ is X:
my_x = permutation[1]
if my_a not in result:
result[my_a] = {}
result[my_a]["key for X"] = my_x
return result
I still have quintuple nesting because I added a check if the key exists (so my original drafts would have sextuple nesting and two productive lines of code) but I got rid of the double iteration over the same iterable and have both minimal index usage and the possibility of working with triplets in the future.
One could avoid the assignments but I prefere "my_a" before permutation[0]

Python user defined sort

I am trying to implement a user defined sort function, similar to the python List sort as in list.sort(cmp = None, key = None, reverse = False) for example.
Here is my code so far
from operator import itemgetter
class Sort:
def __init__(self, sList, key = itemgetter(0), reverse = False):
self._sList = sList
self._key = key
self._reverse = reverse
self.sort()
def sort(self):
for index1 in range(len(self._sList) - 1):
for index2 in range(index1, len(self._sList)):
if self._reverse == True:
if self._sList[index1] < self._sList[index2]:
self._sList[index1], self._sList[index2] = self._sList[index2], self._sList[index1]
else:
if self._sList[index1] > self._sList[index2]:
self._sList[index1], self._sList[index2] = self._sList[index2], self._sList[index1]
List = [[1 ,2],[3, 5],[5, 1]]
Sort(List, reverse = True)
print List
I have a really bad time when it comes to the key parameter.
More specifically, I would like to know if there is a way to code a list with optional indexes (similar to foo(*parameters) ).
I really hope you understand my question.
key is a function to convert the item to a criterion used for comparison.
Called with the item as the sole parameter, it returns a comparable value of your choice.
One classical key example for integers stored as string is:
lambda x : int(x)
so strings are sorted numerically.
In your algorithm, you would have to replace
self._sList[index1] < self._sList[index2]
by
self._key(self._sList[index1]) < self._key(self._sList[index2])
so the values computed from items are compared, rather than the items themselves.
note that Python 3 dropped the cmp method, and just kept key method.
also note that in your case, using itemgetter(0) as the key function works for subscriptable items such as list (sorting by first item only) or str (sorting by first character only).

Compare objects from more than 2 lists

Is there a way to compare all 2-item combinations of more than 2 lists?
Let's say there is an object:
class obj():
def __init__():
self.name = # some name
self.number = random(10)
def equals(obj):
if self.number == obj.number:
return True
else: return False
list1,list2,list3....listX - all these lists contain instances of class obj
I want to compare all 2-item combinations from these lists and return equal objects.
So if there is an obj in list2 which obj.number attribute is 5 and obj in list8 which has obj.number 5, it will be returned.
For two lists the comparison would be simple:
for obj1 in list1:
for obj2 in list2:
if obj1.equals(obj2):
print obj1,obj2
But I don't know how to make this comparison for more lists of objects.
Do you have any advice?
As you might know, with X lists, the time complexity will go up to O(n^X), which is far from optimal (in the case that all lists have the same length =n)
Now it all depends on what you actually want as output. It seems to me that you want to find objects that are present in multiple lists.
One way to do this in a more performant way is to use a dictionary (hashmap) and iterate trough every list. Hash objects based on their self.number.
This will result in something like: {1: [obj1], 2: [obj2, obj3], 3: [obj4], ...}, where the keys are the numbers of the objects and the values are the objects that have these values as number.
By running over this dictionary and only considering entries that have a list with a size larger or equal than 2, you will end up with the objects that are equal.
here the time complexity is equal to the O(n*X), which is ~ O(n)
To illustrate this, I've created a short simple example that uses 2 lists:
from collections import defaultdict
class Obj():
def __init__(self, value):
self.number = value
def find_equals(list1,list2):
d = defaultdict(list)
for obj1 in list1:
d[obj1.number].append(obj1)
for obj2 in list2:
d[obj2.number].append(obj2)
return [d[i] for i in d if len(d[i]) >= 2]
def test():
l1 = [Obj(1),Obj(2),Obj(3),Obj(4)]
l2 = [Obj(5),Obj(2),Obj(3),Obj(6)]
print find_equals(l1,l2)
test()
It can probably be optimised with nifty python constructs, but it shows the idea behind it.
The output is:
[[<__main__.Obj instance at 0x103278440>, <__main__.Obj instance at 0x103278560>], [<__main__.Obj instance at 0x103278488>, <__main__.Obj instance at 0x1032785a8>]]
Which are the objects with the numbers 2 and 3, that were used in the test sample.
A (very) simple approach would be to get the intersection of the lists of objects.
To do that, you have to make your object hashable, to build a set for each list of objects.
def __hash__(self):
return self.number
Then, to check multiple lists, you simply take the set intersection:
x = [Obj(1) Obj(3) Obj(8) Obj(10) Obj(3)]
y = [Obj(2) Obj(9) Obj(10) Obj(3)]
intersection = x & y # -> returns {Obj(3), Obj(10)}
This implementation has worst case complexity (n - 1) * O(L), where L is the maximum of the set lengths and n is the number of sets.
So, in terms of complexity, I think DJanssens's answer is faster.
But if performance is not the problem (e.g. you have small lists etc.), I think it's way more elegant to be able to write:
def intersect(*lists):
return set.intersection(*map(set, lists))
or the same thing in lambda notation:
intersect = lambda *lists: set.intersection(*map(set, lists))

delta-dictionary/dictionary with revision awareness in python?

I am looking to create a dictionary with 'roll-back' capabilities in python. The dictionary would start with a revision number of 0, and the revision would be bumped up only by explicit method call. I do not need to delete keys, only add and update key,value pairs, and then roll back. I will never need to 'roll forward', that is, when rolling the dictionary back, all the newer revisions can be discarded, and I can start re-reving up again. thus I want behaviour like:
>>> rr = rev_dictionary()
>>> rr.rev
0
>>> rr["a"] = 17
>>> rr[('b',23)] = 'foo'
>>> rr["a"]
17
>>> rr.rev
0
>>> rr.roll_rev()
>>> rr.rev
1
>>> rr["a"]
17
>>> rr["a"] = 0
>>> rr["a"]
0
>>> rr[('b',23)]
'foo'
>>> rr.roll_to(0)
>>> rr.rev
0
>>> rr["a"]
17
>>> rr.roll_to(1)
Exception ...
Just to be clear, the state associated with a revision is the state of the dictionary just prior to the roll_rev() method call. thus if I can alter the value associated with a key several times 'within' a revision, and only have the last one remembered.
I would like a fairly memory-efficient implementation of this: the memory usage should be proportional to the deltas. Thus simply having a list of copies of the dictionary will not scale for my problem. One should assume the keys are in the tens of thousands, and the revisions are in the hundreds of thousands.
We can assume the values are immutable, but need not be numeric. For the case where the values are e.g. integers, there is a fairly straightforward implementation (have a list of dictionaries of the numerical delta from revision to revision). I am not sure how to turn this into the general form. Maybe bootstrap the integer version and add on an array of values?
all help appreciated.
Have just one dictionary, mapping from the key to a list of (revision_number, actual_value) tuples. Current value is the_dict[akey][-1][1]. Rollback merely involves popping the appropriate entries off the end of each list.
Update: examples of rollback
key1 -> [(10, 'v1-10'), (20, 'v1-20')]
Scenario 1: current revision is 30, rollback to 25: nothing happens
Scenario 2: current 30, back to 15: pop last entry
Scenario 3: current 30, back to 5: pop both entries
Update 2: faster rollback (with trade-offs)
I think your concern about popping every list is better expressed as "needs to examine every list to see if it needs popping". With a fancier data structure (more memory, more time to maintain the fancy bits in add and update operations) you can reduce the time to roll back.
Add an array (indexed by revision number) whose values are lists of the dictionary values that were changed in that revision.
# Original rollback code:
for rlist in the_dict.itervalues():
if not rlist: continue
while rlist[-1][0] > target_revno:
rlist.pop()
# New rollback code
for revno in xrange(current_revno, target_revno, -1):
for rlist in delta_index[revno]:
assert rlist[-1][0] == revno
del rlist[-1] # faster than rlist.pop()
del delta_index[target_revno+1:]
Update 3: full code for fancier method
import collections
class RevDict(collections.MutableMapping):
def __init__(self):
self.current_revno = 0
self.dict = {}
self.delta_index = [[]]
def __setitem__(self, key, value):
if key in self.dict:
rlist = self.dict[key]
last_revno = rlist[-1][0]
rtup = (self.current_revno, value)
if last_revno == self.current_revno:
rlist[-1] = rtup
# delta_index already has an entry for this rlist
else:
rlist.append(rtup)
self.delta_index[self.current_revno].append(rlist)
else:
rlist = [(self.current_revno, value)]
self.dict[key] = rlist
self.delta_index[self.current_revno].append(rlist)
def __getitem__(self, key):
if not key in self.dict:
raise KeyError(key)
return self.dict[key][-1][1]
def new_revision(self):
self.current_revno += 1
self.delta_index.append([])
def roll_back(self, target_revno):
assert 0 <= target_revno < self.current_revno
for revno in xrange(self.current_revno, target_revno, -1):
for rlist in self.delta_index[revno]:
assert rlist[-1][0] == revno
del rlist[-1]
del self.delta_index[target_revno+1:]
self.current_revno = target_revno
def __delitem__(self, key):
raise TypeError("RevDict doesn't do del")
def keys(self):
return self.dict.keys()
def __contains__(self, key):
return key in self.dict
def iteritems(self):
for key, rlist in self.dict.iteritems():
yield key, rlist[-1][1]
def __len__(self):
return len(self.dict)
def __iter__(self):
return self.dict.iterkeys()
The deluxe solution would be to use B+Trees with copy-on-write. I used a variation on B+Trees to implement my blist data type (which can be used to very efficiently create revisions of lists, exactly analogous to your problem).
The general idea is to store the data in a balanced tree. When you create a new revision, you copy only the root node. If you need to modify a node shared with an older revision, you copy the node and modify the copy instead. That way, the old tree is still completely intact, but you only need memory for the changes (technically, O(k * log n) where k is the number of changes and n is the total number of items).
It's non-trivial to implement, though.

Categories

Resources