Python: Removing lower versions from a 2D list - python

Assume I have the following data structure (a list of lists):
myList = [['Something','1'], ['Something','2'], ['Something Else','5'], ['Yet Another Something','1'], ['Yet ANOTHER Something','0'], ['Yet Another Something','2']]
I have a function that will remove duplicates from that list, choosing the highest number for the 2nd value. However, it seems to choke on very large data sets (150+ entries in myList). For this small data set, I expect the returned list to be:
[['Something','2'], ['Something Else','5'], ['Yet Another Something','2']]
What kind of optimization can be implemented using standard python (without including custom, external modules) into this function so that it returns the same result set without issues on large data sets?
Here is my function:
def remove_duplicates(duplicate):
final_list = []
final_list_upper = []
for k,v in duplicate:
found = False
for x in range(len(final_list)):
if k in final_list[x] or k.upper() in final_list_upper[x]:
if k == final_list[x][0] or k.upper() == final_list_upper[x][0]:
if int(v) >= int(final_list[x][1]):
final_list.pop(final_list.index(final_list[x]))
final_list_upper.pop(final_list_upper.index(final_list_upper[x]))
break
else:
found = True
break
if not found:
final_list.append([k,v])
final_list_upper.append([k.upper(),v])
final_list_upper = [] # clear the list
return final_list

You're using a second loop to check if the current "key" that you're checking exists in the list. This is slowing down your code.
Why? Because, as your code demonstrates, checking for membership in lists is slow. Really slow, because you need to iterate over the entire list, which means it's an O(N) operation, so the time depends linearly on the size of the list.
Instead, you could simply change the list to a dictionary. Lookup in a dictionary is an O(1) operation, so the lookup happens in constant (or nearly constant) time regardless of the size of the dictionary.
When you do this, there's no longer a need for two loops. Here's an idea:
def remove_duplicates_new(duplicate):
final_dict = {}
case_sensitive_keys = {}
for k, v in duplicate:
klower = k.lower()
vint = int(v)
old_val = final_dict.get(klower, 0) # Get the key k, with a default of zero if the key doesn't exist
if vint > old_val:
# Replace if current value is greater than old value
final_dict[klower] = vint
case_sensitive_keys[klower] = k
# Now we're done looping, so create the list
final_list = [[case_sensitive_keys[k], str(v)] for k, v in final_dict.items()]
return final_list
To compare, let's make a test list with 10000 elements. The "keys" are random numbers between 1 and 100, so we're bound to get a whole bunch of duplicates.:
import random
import timeit
testList = [[str(random.randint(1, 100)), str(random.randint(1, 10))] for i in range(10000)]
timeit.timeit('remove_duplicates(testList)', setup='from __main__ import testList, remove_duplicates', number=10)
# Output: 1.1064800999999989
timeit.timeit('remove_duplicates_new(testList)', setup='from __main__ import testList, remove_duplicates_new', number=10)
# Output: 0.03743689999998878
Hot damn! That's a ~30x speedup!

Related

Fastest way to intersect and merge many dictionaries

I have 1000 dictionaries (grid_1, grid_2 ... grid_1000) stored as pickle objects (generated by some previous process) and one reference dictionary. I need to compare each of the pickled dictionaries to the reference, and then combine the results.
The input dictionaries might look like:
grid_1 = {'143741': {'467457':1,'501089':2,'903718':1,'999216':5,'1040952':2},'281092':{'1434': 67,'3345': 345}, '33123': {'4566':5,'56788':45}}
grid_2 = {'143741': {'467457':5,'501089':7,'1040952':9},'281092':{'1434': 67,'3345': 20}, '33123': {'4566':7,'56788':38}}
and the reference dictionary might look like:
grid_density_original = {'143741': {'467457':1,'501089':2,'903718':1,'999216':5,'9990':4},'281092':{'1434': 60,'3345': 3,'9991': 43}, '33123': {'56788':4}}
In the first step, we should intersect the individual grid_n dicts like:
# intersection of grid_1 and grid_density_original
assert intersect_1 == {'143741': {'467457':1,'501089':2,'903718':1,'999216':5},'281092':{'1434': 67,'3345': 345}, '33123': {'56788':45}
# intersection of grid_2 and grid_density_original
assert intersect_2 == {'143741': {'467457':5,'501089':7},'281092':{'1434': 67,'3345': 20}, '33123': {'56788':38}}
Then these results should be combined, as follows:
assert combine12 == {'143741': {'467457':[1,5],'501089':[2,7],'903718':[1,99999],'999216':[5,99999]},'281092':{'1434': [67,67],'3345': [345,20]}, '33123': {'56788':[45,38]}
This appears to b the slow part, as the inner list size increases each time a new intersect_n is added.
This is the code I have currently. My actual dictionaries have on the order of 10,000 keys, and this code takes about 4 days to run.
from collections import defaultdict
from collections import Counter
import pickle
import gc
import copy
import pickle
import scipy.stats as st
from collections import defaultdict
# grid_density_orignal is original nested dictionary we compare each of 1000 grids to:
with open('path\grid_density_original_intercountry.pickle','rb') as handle:
grid_density_orignal = pickle.load(handle,encoding ='latin-1')
# Previous process generated 1000 grids and dump them as .pickle files : grid_1,grid_2....grid_1000
for iteration in range(1,1001):
# load each grid i.e.grid_1,grid_2...grid_1000 into memory sequentially
filename = 'path\grid_%s' %iteration
with open(filename,'rb') as handle:
globals()['dictt_%s' % iteration] = pickle.load(handle,encoding ='latin-1')
# Counter to store grid-grids densities: same dictionary structure as grid_density_orignal
globals()['g_den_%s' % iteration] = defaultdict(list)
for k,v in globals()['dictt_%s' % iteration].items():
globals()['g_den_%s' % iteration][k] = Counter(v)
# here we find the common grid-grid connections between grid_density_orignal and each of the 1000 grids
globals()['intersect_%s' % iteration] = defaultdict(list)
for k,v in grid_density_orignal.items():
pergrid = defaultdict(list)
common_grid_ids = v.keys() & globals()['g_den_%s' % iteration][k].keys()
for gridid in common_grid_ids:
pergrid[gridid] = globals()['g_den_%s' % iteration][k][gridid]
globals()['intersect_%s' % iteration][k] = pergrid
print('All 1000 grids intersection done')
# From previous code we now have 1000 intersection grids : intersect_1,intersect_2 ...... intersect_1000
for iteration in range(1,1000):
itnext = iteration +1 # to get next intersect out of 1000
globals()['combine_%s%s' %(iteration,itnext)] = defaultdict(list) # dictionary to store intermediate combine step results between 2 intersects : intersect_x and intersect_x+1
for k,v in globals()['intersect_%s' %iteration].items():
innr = []
combine = defaultdict(list)
for key in set(list(globals()['intersect_%s' % iteration][k].keys())+ list(globals()['intersect_%s' % itnext][k].keys())): # key in the union of intersect_1 , intersect_2
if (isinstance(globals()['intersect_%s' % iteration][k].get(key,99999), int) and isinstance(globals()['intersect_%s' % itnext][k].get(key,99999), int)): # get key value if exists, if for e.g. a certain grid doesnt exist in intersect_1, intersect_2 we give it default of 99999 as placeholder, alos check if value is an instance of int or list as in intial step it is an int but later we get lists after combining every 2 intersects
combine[key] = [globals()['intersect_%s' % iteration][k].get(key,99999)] + [globals()['intersect_%s' % itnext][k].get(key,99999)] # combine into list intersect_1, intersect_2
if (isinstance(globals()['intersect_%s' % iteration][k].get(key,99999), list) and isinstance(globals()['intersect_%s' % itnext][k].get(key,99999), int)): # this condition will be reached after initial step of intersect_1 + intersect_2
combine[key] = globals()['intersect_%s' % iteration][k].get(key,99999) + [globals()['intersect_%s' % itnext][k].get(key,99999)] # combine into list intersect_1, intersect_2
globals()['combine_%s%s' %(iteration,itnext)][k] = combine
globals()['intersect_%s' % itnext] = copy.copy(globals()['combine_%s%s' %(iteration,itnext)]) # copy combine dict onto intersect dict so we can continue combining this dict with the next iteration
print('2 combined: ',iteration,itnext)
del globals()['intersect_%s' % iteration] # delete older intersect, combine as we dont need them and may cause memory overflow when more dicts are in memory
del globals()['combine_%s%s' %(iteration,itnext)]
gc.collect() # explicitly call the garbage collector as too big for ram
print('intersection and combine done') # at the end we have intersect_1000 with is a dict with all grid_id ids as keys and a list of densities (list size is 1000 corresponding to 1000 grids)
How can I improve the performance of the code?
I will focus on the second loop (merging the intersect_n dicts), give some general advice, and leave the rest as an exercise. My final result is so much shorter than the original that I feel the need to break down the process into several steps. Hopefully you will learn many useful techniques along the way.
After removing blank lines, comments (we'll rewrite those later anyway) and debug traces, and cleaning up a bit of whitespace we are starting with:
for iteration in range(1, 1000):
itnext = iteration + 1
globals()['combine_%s%s' % (iteration, itnext)] = defaultdict(list)
for k, v in globals()['intersect_%s' % iteration].items():
innr = []
combine = defaultdict(list)
for key in set(list(globals()['intersect_%s' % iteration][k].keys()) + list(globals()['intersect_%s' % itnext][k].keys())):
if (isinstance(globals()['intersect_%s' % iteration][k].get(key, 99999), int) and isinstance(globals()['intersect_%s' % itnext][k].get(key, 99999), int)):
combine[key] = [globals()['intersect_%s' % iteration][k].get(key, 99999)] + [globals()['intersect_%s' % itnext][k].get(key, 99999)]
if (isinstance(globals()['intersect_%s' % iteration][k].get(key, 99999), list) and isinstance(globals()['intersect_%s' % itnext][k].get(key, 99999), int)):
combine[key] = globals()['intersect_%s' % iteration][k].get(key, 99999) + [globals()['intersect_%s' % itnext][k].get(key, 99999)]
globals()['combine_%s%s' % (iteration, itnext)][k] = combine
globals()['intersect_%s' % itnext] = copy.copy(globals()['combine_%s%s' % (iteration, itnext)])
del globals()['intersect_%s' % iteration]
del globals()['combine_%s%s' % (iteration, itnext)]
gc.collect()
Now we can get to work.
1. Properly structuring the data
Trying to create variable variables is generally a bad idea. It also has a performance cost:
$ python -m timeit -s "global x_0; x_0 = 'test'" "globals()['x_%s' % 0]"
2000000 loops, best of 5: 193 nsec per loop
$ python -m timeit -s "global x; x = ['test']" "x[0]"
10000000 loops, best of 5: 29.1 nsec per loop
Yes, we're talking about nanoseconds, but the existing code is doing it constantly, for nearly every access. But more importantly, visually simplified code is easier to analyze for subsequent improvements.
Clearly we already know how to manipulate nested data structures; adding one more level of nesting isn't an issue. To store the intersect_n results, rather than having 1000 dynamically named variables, the obvious solution is to just make a 1000-element list, where each element is one of those results. (Note that we will start counting them from 0 rather than 1, of course.) As for globals()['combine_%s%s' % (iteration, itnext)] - that makes no sense; we don't need to create a new variable name each time through, because we're going to throw that data away at the end of the loop anyway. So let's just use a constant name.
Once the first loop is modified to give the right data (which will also look much simpler in that part), the access looks much simpler here:
for iteration in range(999):
itnext = iteration + 1
combine_overall = defaultdict(list)
for k, v in intersect[iteration].items():
combine = defaultdict(list)
for key in set(list(intersect[iteration][k].keys()) + list(intersection[itnext][k].keys())):
if (isinstance(intersect[iteration][k].get(key, 99999), int) and isinstance(intersect[itnext][k].get(key, 99999), int)):
combine[key] = [intersect[iteration][k].get(key, 99999)] + [intersect[itnext][k].get(key, 99999)]
if (isinstance(intersect[iteration][k].get(key, 99999), list) and isinstance(intersect[itnext][k].get(key, 99999), int)):
combine[key] = intersect[iteration][k].get(key, 99999) + [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
intersect[itnext] = copy.copy(combine_overall)
You'll notice I removed the memory management stuff at the end. I'll discuss a better approach for that later. The del for the iteration value would mess up iterating over that list, and we don't need to delete combine_overall because we'll just replace it with a new empty defaultdict. I also sneakily removed innr = [], because the value is simply unused. Like I said: visually simpler code is easier to analyze.
2. Unnecessary type checks
All this isinstance stuff is hard to read, and time consuming especially considering all the repeated access:
$ python -m timeit -s "global x; x = {'a': {'b': {'c': 0}}}" "isinstance(x['a']['b'].get('c', 0), int)"
2000000 loops, best of 5: 138 nsec per loop
$ python -m timeit -s "global x; x = {'a': {'b': {'c': 0}}}" "x['a']['b'].get('c', 0)"
5000000 loops, best of 5: 83.9 nsec per loop
We know the exact conditions under which intersect[itnext][k].get(key, 99999) should be an int: always, or else the data is simply corrupted. (We can worry about that later, and probably by doing exception handling in the calling code.) We know the conditions under which intersect[iteration][k].get(key, 99999) should be an int or a list: it will be an int (or missing) the first time through, and a list (or missing) every other time. Fixing this will also make it easier to understand the next step.
for iteration in range(999):
itnext = iteration + 1
combine_overall = defaultdict(list)
for k, v in intersect[iteration].items():
combine = defaultdict(list)
for key in set(list(intersect[iteration][k].keys()) + list(intersection[itnext][k].keys())):
if iteration == 0:
combine[key] = [intersect[iteration][k].get(key, 99999)] + [intersect[itnext][k].get(key, 99999)]
else:
combine[key] = intersect[iteration][k].get(key, [99999]) + [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
intersect[itnext] = copy.copy(combine_overall)
Notice how, when the key is either a list or missing, we use a list as the default value. That's the trick to preserving type consistency and making it possible to write the code this way.
3. An unnecessary copy and unnecessary pair-wise iteration
Since combine_overall isn't referenced anywhere else, we don't actually need to copy it over the intersect[itnext] value - we could just reassign it without any aliasing issues. But better yet is to just leave it where it is. Instead of considering adjacent pairs of iteration values that we merge together pairwise, we just merge everything into combine_overall, one at a time (and set up an initial defaultdict once instead of overwriting it).
This does mean we'll have to do some setup work - instead of special-casing the first merge, we'll "merge" intersect[0] by itself into the initial state of combine_overall.
combine_overall = defaultdict(list)
for k, v in intersect[0].items():
combine = defaultdict(list)
for key, value in v.keys():
combine[key] = [value]
combine_overall[k] = combine
for iteration in range(999):
itnext = iteration + 1
for k, v in combine_overall.items():
combine = defaultdict(list)
for key in set(list(combine_overall[k].keys()) + list(intersection[itnext][k].keys())):
combine[key] = combine_overall[k].get(key, [99999]) + [intersect[itnext][k].get(key, 99999)]
combine_overall[k] = combine
Notice how much more simply we can do the initial step - we know which keys we're working with, so no .gets are necessary; and there's only one dict, so no merging of key-sets is necessary. But we aren't done...
4. Some miscellaneous cleanup
Looking at this version, we can more easily notice:
The iteration loop doesn't use iteration at all, but only itnext - so we can fix that. Also, there's no reason to use indexes like this for a simple loop - we should directly iterate over elements.
combine_overall will hold dicts, not lists (as we assign the values from combine, which is a defaultdict); so defaultdict(list) makes no sense.
Instead of using a temporary combine to build a replacement for combine_overall[k] and then assigning it back, we could just directly modify combine_overall[k]. In this way, we would actually get benefit from the defaultdict behaviour. We actually want the default values to be defaultdicts themselves - not completely straightforward, but very doable.
Since we no longer need to make a distinction between the overall combined result and individual results, we can rename combine_overall to just combine to look a little cleaner.
combine = defaultdict(lambda: defaultdict(list))
for k, v in intersect[0].items():
for key, value in v.keys():
combine[k][key] = [value]
for to_merge in intersect[1:]:
for k, v in combine.items():
for key in set(list(combine[k].keys()) + list(to_merge[k].keys())):
combine[k][key] = combine[k].get(key, [99999]) + [to_merge[k].get(key, 99999)]
5. Oops, there was a bug all along. Also, "special cases aren't special enough to break the rules"
Hopefully, this looks a little strange to you. Why are we using .get on a defaultdict? Why would we have this single-item placeholder value, rather than an empty list? Why do we have to do this complicated check for the possible keys to use? And do we really need to handle the first intersect value differently?
Consider what happens on the following data (using the original naming conventions):
intersect_1 = {'1': {'1': 1}}
intersect_2 = {'1': {'1': 1}}
intersect_3 = {'1': {'1': 1, '2': 1}}
With the original approach, I get a result like:
$ python -i tmp.py
>>> intersect_3
defaultdict(<class 'list'>, {'1': defaultdict(<class 'list'>, {'2': [99999, 1], '1': [1, 1, 1]})})
Oops. intersect_3['1']['2'] only has two elements ([99999, 1]), and thus doesn't match up with intersect_3['1']['1']. That defeats the purpose of the 99999 placeholder values. The problem is that, because the value was missing multiple times at the start, multiple 99999 values should have been inserted, but only one was - the one that came from creating the initial list, when the isinstance check reported an int rather than a list, when it retrieved the 99999 with .get. That lost information: we couldn't distinguish between a missing int and a missing list.
How do we work around this? Simple: we use the same, overall key set each time - the total set of keys that should be present, which we get from grid_density_original[k]. Whenever one of those entries is missing in any of the intersect results, we write the placeholder value instead. Now we are also handling each intersect result the same way - instead of doing special setup with the first value, and merging everything else in, we are merging everything in to an empty initial state.
Instead of iterating over the .items of combine (and expecting to_merge to have the same keys), we iterate over to_merge, which makes a lot more sense. Instead of creating and assigning a list for combine[k][key], we simply append a value to the existing list (and we know there is one available, because we are using defaultdicts properly now).
Thus:
combine = defaultdict(lambda: defaultdict(list))
for to_merge in intersect:
for k, v in to_merge.items():
# BTW, you typo'd this as "orignal" in the original code
for key in grid_density_original[k].keys():
combine[k][key].append(v.get(key, 99999))
(This does mean that, if none of the intersect dicts contain a particular key, the result will contain a list of 1000 99999 values, instead of omitting the key completely. I hope that isn't an issue.)
6. Okay, but weren't you going to do something about the memory usage?
Oh, right. Take a moment to write the corresponding code for the first loop, please.
Got it? Okay, now. Set up combine first; and then each time you compute one of the would-be elements of intersect, merge it in (using the two inner loops shown here) instead of building the actual intersect list.
Oh, and I think you'll find that - since we're going to iterate over grid_density_original[k].keys() anyway - the preprocessing to remove other keys from the g_den results isn't actually needed at all now.

What is the time complexity of dictionary on all operation with tuple value as its key in python?

I am learning Python and wondering what would be the time complexity of dictionary on all operation(Copy[2], Get Item, Set Item[1], Delete Item, Iteration[2]).
If I use tuple values as a key in the dictionary. Can someone please throw some light ?
Here's the sample code.
keys = ("Name", "age", "height")
dog = {keys[0]: "labrador", keys[1]: 2, keys[2]: 3}
Based on this link:
Except copy or iteration, dict have a complexity of O(1) or worse case O(n).
For copy or iteration, it's logical that it is O(n) as there is n elements that you can access in O(1) complexity.
As commented by Laurent, all dictionnary are "stored" based on the hash of the key (link). In some cases, it can happens that 2 keys have the same hash (collision). In such cases, I imagine, it end up in the "Amortized Worst Case" as it's gonna check all keys as is and not the hash. You can also find more information about the build of dictionnaries on this topic.
In addition, for copy and iterations, you should consider the biggest n reach in the dictionnary. If you fill 10000 entries then remove 9990 and do a copy, you should consider 10000 elements and not 10.
EDIT 1: Test times based on key time
You can test the time based on key type with following code :
import timeit
def f1():
a = {}
for i in range(1000):
a[i] = i
return a
def f2():
a = {}
for i in range(1000):
a[str(i)] = i
return a
def f3():
a = {}
for i in range(1000):
a[(i, i)] = i
return a
def f4():
a = {}
for i in range(1000):
a[(str(i), str(i))] = i
return a
print(timeit.timeit("f1()", number = 1000, setup="from __main__ import f1")) # => 0.0541156338055877
print(timeit.timeit("f2()", number = 1000, setup="from __main__ import f2")) # => 0.2492961917066741
print(timeit.timeit("f3()", number = 1000, setup="from __main__ import f3")) # => 0.09082684668080204
print(timeit.timeit("f4()", number = 1000, setup="from __main__ import f4")) # => 0.4479192082498416
Unfortunately, you cannot compare them 1 by 1 as f2 and f4 use str() function which slow down the function but you can at least see that tuple is slower than simple string or integer. Nevertheless you can estimate with the following code the time used to convert to str()
def f5():
for i in range(1000):
str(i)
print(timeit.timeit("f5()", number = 1000, setup="from __main__ import f5"))
And this is evaluate to 0.1645s. Without this time is f2 and twice in f4 you have :
0.08614475546254619s for f2
0.12553885821459693s for f3
This gives you some idea about the time to create dict based on the key style. You can do the same to test time access but if hashes doesn't collide, time should be the same no matter the key style.
You can also take a look to this video I found tonight
I hope it helps,

Replace for with list comprehension

In this script I have used both list comprehension and for. I need to replace for loop with comprehension and add this solve inside list comprehension.
How can add
for i in k:
count_list.append(l.count(i))
inside this block
pairs = [int(pair/2) for pair in count_list if int(pair/2) != 0]
My code:
def sockMerchant(ar):
l = ar
k = set(l)
count_list = []
for i in k:
count_list.append(l.count(i))
pairs = [int(pair/2) for pair in count_list if int(pair/2) != 0]
return sum(pairs)
n = int(input().strip())
ar = list(map(int, input().strip().split(' ')))
result = sockMerchant(ar)
print(result)
You should not be using a list comprehension at all, nor the for loop you have now. The loop is inefficient; by using list.count() you are traversing the whole list l for every unique value, creating a O(N^2) loop.
Use a collections.Counter() object instead and count in O(N) time:
from collections import Counter
def sockMerchant(ar):
counts = Counter(ar)
return sum(count//2 for count in counts.values())
or even
def sockMerchant(ar):
return sum(count//2 for count in Counter(ar).values())
if you insist on a single line.
Note that sum() doesn't mind a few 0 values here and there, so I removed the if test for single 'socks'. Also, I used the // floor division operator rather than turning the floating point result of dividing by 2 back into an integer.

Cython dictionary / map

I have a list of element, label pairs like this: [(e1, l1), (e2, l2), (e3, l1)]
I have to count how many labels two element have in common - ie. in the list above e1and e3have the label l1 in common and thus 1 label in common.
I have this Python implementation:
def common_count(e_l_list):
count = defaultdict(int)
l_list = defaultdict(set)
for e1, l in e_l_list:
for e2 in l_list[l]:
if e1 == e2:
continue
elif e1 > e2:
count[e1,e2] += 1
else:
count[e2,e1] += 1
l_list[l].add(e1)
return count
It takes a list like the one above and computes a dictionary of element pairs and counts. The result for the list above should give {(e1, e2): 1}
Now i have to scale this to millions of elements and labels and i though Cython would be a good solution to save CPU time and memory. But i can't find docs on how to use maps in Cython.
How would i implement the above in pure Cython?
It can be asumed that all elements and labels are unsigned integers.
Thanks in advance :-)
I think you are trying to over complicate this by creating pairs of elements and storing all common labels as the value when you can create a dict with the element as the key and have a list of all values associated with that element. When you want to find common labels convert the lists to a set and perform an intersection on them, the resulting set will have the common labels between the two. The average time of the intersection, checked with ~20000 lists, is roughly 0.006 or very fast
I tested this with this code
from collections import *
import random
import time
l =[]
for i in xrange(10000000):
#With element range 0-10000000 the dictionary creation time increases to ~16 seconds
l.append((random.randrange(0,50000),random.randrange(0,50000)))
start = time.clock()
d = defaultdict(list)
for i in l: #O(n)
d[i[0]].append(i[1]) #O(n)
print time.clock()-start
times = []
for i in xrange(10000):
start = time.clock()
tmp = set(d[random.randrange(0,50000)]) #picks a random list of labels
tmp2 = set(d[random.randrange(0,50000)]) #not guaranteed to be a different list but more than likely
times.append(time.clock()-start)
common_elements = tmp.intersection(tmp2)
print sum(times)/100.0
18.6747529999 #creation of list
4.17812619876 #creation of dictionary
0.00633531142994 #intersection
Note: The times do change slightly depending on number of labels. Also creating the dict might be too long for your situation but that is only a one time operation.
I would also highly not recommend creating all pairs of elements. If you have 5,000,000 elements and they all share at least one label, which is worst case, then you are looking at 1.24e+13 pairs or, more bluntly, 12.5 trillion. That would be ~1700 terabytes or ~1.7 petabytes

Fastest way to sort list of dates when adding to the list

I am writing something that is similar to a task scheduler. I have two sets of tasks, some which are fixed (they are given a start and end date and time) and some which are not fixed (they are given a start date and time and a duration).
The non-fixed tasks are influenced by the fixed tasks, so that if a non-fixed task is overlapped by a fixed task, the non-fixed task will extend its duration by the amount of overlap.
I start with a list of tuples where the first item is the starting date and the second item is the ID for that fixed task, like this:
[(2012-04-30, 1), (2012-05-01, 5), (2012-05-04, 2)]
I then have another list, which is ordered by the user, of the non-fixed tasks. The idea is that I'll loop through this list, and inside of that loop I'll loop through the first list to find the tasks that could overlap with this task, and can figure out which how much to extend the non-fixed task.
Here is where I'm asking for your help. Now that I know the calculated start and end times of this non-fixed task, I need to consider it "fixed" so that it influences the rest of the non-fixed tasks.
I can add this task to the first list of fixed tasks and sort it again, but that means that I'm sorting the list every time I add a task to it.
I can loop through the first list and find the point where this task should be inserted, and then insert it there. But, if its place is early in the list, time is spent shifting all of the other items one place. And if its place is late in the list, I would have to loop through a lot of elements to reach the correct place.
So, I'm not sold on using either of those options. The real question here is: What's the best way to keep a list sorted while adding things to it? Or is there a much better way of doing all of this?
Here is the example of using bisect and comparison with using the sort of the partially sorted list. The bisect solution clearly wins:
import bisect
import random
import timeit
def bisect_solution(size=10000):
lst = []
for n in xrange(size):
value = random.randint(1, 1000000)
bisect.insort_left(lst, value)
return lst
# Cut out of the bisect module to be used in bisect_solution2()
def insort_left(a, x, lo=0, hi=None):
"""Insert item x in list a, and keep it sorted assuming a is sorted.
If x is already in a, insert it to the left of the leftmost x.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if a[mid] < x: lo = mid+1
else: hi = mid
a.insert(lo, x)
def bisect_solution2(size=10000):
lst = []
for n in xrange(size):
value = random.randint(1, 1000000)
insort_left(lst, value)
return lst
def sort_solution(size=10000):
lst = []
for n in xrange(size):
value = random.randint(1, 1000000)
lst.append(value)
lst.sort()
return lst
t = timeit.timeit('bisect_solution()', 'from __main__ import bisect_solution', number = 10)
print "bisect_solution: ", t
t = timeit.timeit('bisect_solution2()', 'from __main__ import bisect_solution2', number = 10)
print "bisect_solution2: ", t
t = timeit.timeit('sort_solution()', 'from __main__ import sort_solution', number = 10)
print "sort_solution: ", t
The bisect_solution2() is almost the same as bisect_solution() -- only with the code copied-out of the module. Someone else should explain why it takes more time :)
The bisect_solution2() is here to be modified for cmp() function to be able to compare the tuples.
It shows the following results on my computer:
bisect_solution: 0.637892403587
bisect_solution2: 0.988893038133
sort_solution: 15.3521410901
Here is a bisect solution adopted for the tuples where date is a string:
import random
import timeit
def random_date_tuple():
s1 = '{0}-{1:02}-{2:02}'.format(random.randint(2000, 2050),
random.randint(1, 12),
random.randint(1, 31))
e2 = random.randint(1,50)
return (s1, e2)
def my_cmp(a, b):
result = cmp(a[0], b[0]) # comparing the date part of the tuple
if result == 0:
return cmp(a[1], b[1]) # comparint the other part of the tuple
return result
def my_insort_left(a, x, cmp=my_cmp, lo=0, hi=None):
"""The bisect.insort_left() modified for comparison of tuples."""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if cmp(a[mid], x) < 0:
lo = mid+1
else:
hi = mid
a.insert(lo, x)
def bisect_solution3(size=1000):
lst = []
for n in xrange(size):
value = random_date_tuple()
my_insort_left(lst, value)
return lst
def sort_solution(size=1000):
lst = []
for n in xrange(size):
value = random_date_tuple()
lst.append(value)
lst.sort(cmp=my_cmp)
return lst
t = timeit.timeit('bisect_solution3()', 'from __main__ import bisect_solution3', number = 10)
print "bisect_solution3: ", t
t = timeit.timeit('sort_solution()', 'from __main__ import sort_solution', number = 10)
print "sort_solution: ", t
print bisect_solution3()[:10]
Notice that the list size is 10 times less than in the previous as the sort solution was very slow. It prints:
bisect_solution3: 0.223602245968
sort_solution: 3.69388944301
[('2000-02-01', 20), ('2000-02-13', 48), ('2000-03-11', 25), ('2000-03-13', 43),
('2000-03-26', 48), ('2000-05-04', 17), ('2000-06-06', 23), ('2000-06-12', 31),
('2000-06-15', 15), ('2000-07-07', 50)]
The real question here is: What's the best way to keep a list sorted while adding things to it?
Insertion Sort is the way to go. But you might not like it as have already know this. The next thing you can do is this,
Dont sort while adding.
When you get the items sort it and cache it. When its requested next time Show from previous cache.
Invalidate the cache when any new items is added.
I am not a python programmer But I can give you some idea with a PHP class.
class SortedList(){
public $list = array();
private $cached_list;
public function add($item){
array_push($this->list, $item);
$this->sorted = false;
}
public function get(){
if($this->sorted==true){
return $this->cached_list;
}
// sort the array;
// copying the list to cached list and sort it
$this->cached_list = $this->list;
sort($this->cached_list);
// set the flag
$this->sorted = true;
return $this->cached_list
}
}
I can loop through the first list and find the point where this task
should be inserted, and then insert it there. But, if its place is
early in the list, time is spent shifting all of the other items one
place. And if its place is late in the list, I would have to loop
through a lot of elements to reach the correct place.
Finding the right place to insert something into a sorted list can be done in O(log n) using binary search. Inserting would still be O(n).
There are more complicated data structures like B-Trees that allow inserting and searching in O(log n). Have a look at this and this.
A Heap Queue is your friend. From Wikipedia:
The operations commonly performed with a heap are:
create-heap: create an empty heap
find-max: find the maximum item of a max-heap
delete-max: removing the root node of a max-heap
increase-key: updating a key within a max-heap
insert: adding a new key to the heap
merge: joining two heaps to form a valid new heap containing all the
elements of both.
There is a builtin Python Heap Queue implementation. Heaps are optimized for 1) removing the max element, 2) inserting new elements to maintain the heap ordering.

Categories

Resources