Python: fast dictionary of big int keys

Python: fast dictionary of big int keys - python

I have got a list of >10.000 int items. The values of the items can be very high, up to 10^27. Now I want to create all pairs of the items and calculate their sum. Then I want to look for different pairs with the same sum.
For example:
l[0] = 4
l[1] = 3
l[2] = 6
l[3] = 1
...
pairs[10] = [(0,2)] # 10 is the sum of the values of l[0] and l[2]
pairs[7] = [(0,1), (2,3)] # 7 is the sum of the values of l[0] and l[1] or l[2] and l[3]
pairs[5] = [(0,3)]
pairs[9] = [(1,2)]
...
The contents of pairs[7] is what I am looking for. It gives me two pairs with the same value sum.
I have implemented it as follows - and I wonder if it can be done faster. Currently, for 10.000 items it takes >6 hours on a fast machine. (As I said, the values of l and so the keys of pairs are ints up to 10^27.)
l = [4,3,6,1]
pairs = {}
for i in range( len( l ) ):
for j in range(i+1, len( l ) ):
s = l[i] + l[j]
if not s in pairs:
pairs[s] = []
pairs[s].append((i,j))
# pairs = {9: [(1, 2)], 10: [(0, 2)], 4: [(1, 3)], 5: [(0, 3)], 7: [(0, 1), (2, 3)]}
Edit: I want to add some background, as asked by Simon Stelling.
The goal is to find Formal Analogies like
lays : laid :: says : said
within a list of words like
[ lays, lay, laid, says, said, foo, bar ... ]
I already have a function analogy(a,b,c,d) giving True if a : b :: c : d. However, I would need to check all possible quadruples created from the list, which would be a complexity of around O((n^4)/2).
As a pre-filter, I want to use the char-count property. It says that every char has the same count in (a,d) and in (b,c). For instance, in "layssaid" we have got 2 a's, and so we do in "laidsays"
So the idea until now was
for every word to create a "char count vector" and represent it as an integer (the items in the list l)
create all pairings in pairs and see if there are "pair clusters", i.e. more than one pair for a particular char count vector sum.
And it works, it's just slow. The complexity is down to around O((n^2)/2) but this is still a lot, and especially the dictionary lookup and insert is done that often.

There are the trivial optimizations like caching constant values in a local variable and using xrange instead of range:
pairs = {}
len_l = len(l)
for i in xrange(len_l):
for j in xrange(i+1, len_l):
s = l[i] + l[j]
res = pairs.setdefault(s, [])
res.append((i,j))
However, it is probably far more wise to not pre-calculate the list and instead optimize the method on a concept level. What is the intrinsic goal you want to achieve? Do you really just want to calculate what you do? Or are you going to use that result for something else? What is that something else?

Just a hint. Have a look on itertools.combinations.
This is not exactly what you are looking for (because it stores pair of values, not of indexes), but it can be a starting code:
from itertools import combinations
for (a, b) in combinations(l, 2):
pairs.setdefault(a + b, []).append((a, b))

The above comment from SimonStelling is correct; generating all possible pairs is just fundamentally slow, and there's nothing you can do about it aside from altering your algorithm. The correct function to use from itertools is product; and you can get some minor improvements from not creating extra variables or doing unnecessary list indexes, but underneath the hood these are still all O(n^2). Here's how I would do it:
from itertools import product
l = [4,3,6,1]
pairs = {}
for (m,n) in product(l,repeat=2):
pairs.setdefault(m+n, []).append((m,n))

Finally, I have came up with my own solution, just taking half of the calculation time on average.
The basic idea: Instead of reading and writing into the growing dictionary n^2 times, I first collect all the sums in a list. Then I sort the list. Within the sorted list, I then look for same neighbouring items.
This is the code:
from operator import itemgetter
def getPairClusters( l ):
# first, we just store all possible pairs sequentially
# clustering will happen later
pairs = []
for i in xrange( len( l) ):
for j in xrange(i+1, len( l ) ):
pair = l[i] + l[j]
pairs.append( ( pair, i, j ) )
pairs.sort(key=itemgetter(0))
# pairs = [ (4, 1, 3), (5, 0, 3), (7, 0, 1), (7, 2, 3), (9, 1, 2), (10, 0, 2)]
# a list item of pairs now contains a tuple (like (4, 1, 3)) with
# * the sum of two l items: 4
# * the index of the two l items: 1, 3
# now clustering starts
# we want to find neighbouring items as
# (7, 0, 1), (7, 2, 3)
# (since 7=7)
pairClusters = []
# flag if we are within a cluster
# while iterating over pairs list
withinCluster = False
# iterate over pair list
for i in xrange(len(pairs)-1):
if not withinCluster:
if pairs[i][0] == pairs[i+1][0]:
# if not within a cluster
# and found 2 neighbouring same numbers:
# init new cluster
pairCluster = [ ( pairs[i][1], pairs[i][2] ) ]
withinCluster = True
else:
# if still within cluster
if pairs[i][0] == pairs[i+1][0]:
pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
# else cluster has ended
# (next neighbouring item has different number)
else:
pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
pairClusters.append(pairCluster)
withinCluster = False
return pairClusters
l = [4,3,6,1]
print getPairClusters(l)

Related

I'm not able to understand this code in tuple

init_tuple = [(0, 1), (1, 2), (2, 3)]
result = sum(n for _, n in init_tuple)
print(result)
The output for this code is 6. Could someone explain how it worked?

Your code extracts each tuple and sums all values in the second position (i.e. [1]).
If you rewrite it in loops, it may be easier to understand:
init_tuple = [(0, 1), (1, 2), (2, 3)]
result = 0
for (val1, val2) in init_tuple:
result = result + val2
print(result)

The expression (n for _, n in init_tuple) is a generator expression. You can iterate on such an expression to get all the values it generates. In that case it reads as: generate the second component of each tuple of init_tuple.
(Note on _: The _ here stands for the first component of the tuple. It is common in python to use this name when you don't care about the variable it refers to (i.e., if you don't plan to use it) as it is the case here. Another way to write your generator would then be (tup[1] for tup in init_tuple))
You can iterate over a generator expression using for loop. For example:
>>> for x in (n for _, n in init_tuple):
>>> print(x)
1
2
3
And of course, since you can iterate on a generator expression, you can sum it as you have done in your code.

To get better understanding first look at this.
init_tuple = [(0, 1), (1, 2), (2, 3)]
sum = 0
for x,y in init_tuple:
sum = sum + y
print(sum)
Now, you can see that what above code does is that it calculate sum of second elements of tuple, its equivalent to your code as both does same job.
for x,y in init_tuple:
x hold first value of tuple and y hold second of tuple, in first iteration:
x = 0, y = 1,
then in second iteration:
x = 1, y = 2 and so on.
In your case you don't need first element of tuple so you just use _ instead of using variable.

Generating 3-tuples from a set of 2-tuples

In an earlier question:
Generating maximum number of 3-tuples from a list of 2-tuples
I got an answer from #AChampion that seems to work if the number of 2-tuples is divisible by 3. However, the solution fails if we, for example, have 10 2-tuples. After fumbling with it for a while I'm under the impression that it is impossible to find a perfect solution for say:
(1,2)(1,3),(1,4),(2,3),(2,4),(3,4)
So I'm interested in finding one solution that minimizes the number of remainder tuples. In the example above the result could be:
(1,2,3) # derived from (1,2), (1,3), (2,3)
(1,4),(2,4),(3,4) # remainder tuples
The rule for generating 3-tuple from 3 2-tuple is:
(a,b), (b,c), (c,a) -> (a, b, c)
i.e. the 2-tuples is a cycle with length 3. The order of the elements in a 3-tuple is not important, i.e:
(a,b,c) == (c,a,b)
I'm actually interested in the case where we have a number n:
for x in range(1,n+1):
for y in range(1,n+1):
if x!=y:
a.append((x,y))
# a = [ (1,2),...,(1,n), (2,1),(2,3),...,(2,n),...(n,1),...,(n,n-1) ]
From a, minimize the number of 2-tuples that is left when producing 3-tuples. Each 2-tuple can only be used once.
I wrapped my brain around this for several hours but I can't seem to come up with an elegant solution (well, neither have I found an ugly one:-) for the general case. Any thoughts?

For this you need to create number of combinations that will use for replacement. Then loop over you data for 3 item that contains any of above combinations and replace them.
I have done thi in several steps.
from itertools import combinations
# create replacements elements
number_combinations_raw = list(combinations(range(1, 5), 3))
# create proper number combinations
number_combinations = []
for item in number_combinations_raw:
if (item[0] + 1 == item[1]) and (item[1] + 1 == item[2]):
number_combinations.append(item)
# create test data
data = [(1, 2), (1, 3), (1, 4), (2, 3), (2, 4)]
# reduce data
reduce_data = []
for number_set in number_combinations:
count = 0
merged_data = []
for item in data:
if (number_set[0] in item and number_set[1] in item) or (number_set[1] in item and number_set[2] in item) \
or (number_set[0] in item and number_set[2] in item):
merged_data.append(item)
count += 1
if count == 3:
reduce_data.append((number_set, merged_data))
# delete merged elements from data list and add replacement
for item in data:
for reduce_item in reduce_data:
for element in reduce_item[1]:
if element in data:
data.remove(element)
data = [reduce_item[0]] + data
# remove duplicated replaced elements
final_list = list(dict.fromkeys(data))
Output:
[(1, 2, 3), (1, 4), (2, 4)]

How can I filter a dictionary with arbitrary length tuples as keys efficiently?

TL;DR
What is the most efficient way to implement a filter function for a dictionary with keys of variable dimensions? The filter should take a tuple of the same dimensions as the dictionary's keys and output all keys in the dictionary which match the filter such that filter[i] is None or filter[i] == key[i] for all dimensions i.
In my current project, I need to handle dictionaries with a lot of data. The general structure of the dictionary is such that it contains tuples with 2 to 4 integers as keys and integers as values. All keys in a dictionary have the same dimensions. To illustrate, the following are examples of dictionaries I need to handle:
{(1, 2): 1, (1, 5): 2}
{(1, 5, 3): 2}
{(5, 2, 5, 2): 8}
These dictionaries contain a lot of entries, with the largest ones at about 20 000 entries. I frequently need to filter these entries, but often only looking at certain indices of the key tuples. Ideally, I want to have a function to which I can supply a filter tuple. The function should then return all keys which match the filter tuple. If the filter tuple contains a None entry, then this will match any value in the dictionary's key tuple at this index.
Example of what the function should do for a dictionary with 2-dimensional keys:
>>> dict = {(1, 2): 1, (1, 5): 2, (2, 5): 1, (3, 9): 5}
>>> my_filter_fn((1, None))
{(1, 2), (1, 5)}
>>> my_filter_fn((None, 5))
{(1, 5), (2, 5)}
>>> my_filter_fn((2, 4))
set()
>>> my_filter_fn((None, None))
{(1, 2), (1, 5), (2, 5), (3, 9)}
As my dictionaries have different dimensions of their tuples, I have tried solving this problem by writing a generator expression which takes the dimensions of the tuple into account:
def my_filter_fn(entries: dict, match: tuple):
return (x for x in entries.keys() if all(match[i] is None or match[i] == x[i]
for i in range(len(key))))
Unfortunately, this is quite slow compared to writing out condition completely by hand ((match[0] is None or match[0] === x[0]) and (match[1] is None or match[1] == x[1]); for 4 dimensions this is about 10 times slower. This is a problem for me as I need to do this filtering quite often.
Following code demonstrates the performance issue. Code is just supplied to illustrate the problem and enable reproduction of the tests. You can skip the code part, results are below.
import random
import timeit
def access_variable_length():
for key in entry_keys:
for k in (x for x in all_entries.keys() if all(key[i] is None or key[i] == x[i]
for i in range(len(key)))):
pass
def access_static_length():
for key in entry_keys:
for k in (x for x in all_entries.keys() if
(key[0] is None or x[0] == key[0])
and (key[1] is None or x[1] == key[1])
and (key[2] is None or x[2] == key[2])
and (key[3] is None or x[3] == key[3])):
pass
def get_rand_or_none(start, stop):
number = random.randint(start-1, stop)
if number == start-1:
number = None
return number
entry_keys = set()
for h in range(100):
entry_keys.add((get_rand_or_none(1, 200), get_rand_or_none(1, 10), get_rand_or_none(1, 4), get_rand_or_none(1, 7)))
all_entries = dict()
for l in range(13000):
all_entries[(random.randint(1, 200), random.randint(1, 10), random.randint(1, 4), random.randint(1, 7))] = 1
variable_time = timeit.timeit("access_variable_length()", "from __main__ import access_variable_length", number=10)
static_time = timeit.timeit("access_static_length()", "from __main__ import access_static_length", number=10)
print("variable length time: {}".format(variable_time))
print("static length time: {}".format(static_time))
Results:
variable length time: 9.625867042849316
static length time: 1.043319165662158
I would like to avoid having to create three different functions my_filter_fn2, my_filter_fn3, and my_filter_fn4 to cover all possible dimensions of my dictionaries and then use static dimensions filtering. I am aware that filtering for variable dimensions will always be slower than filtering for fixed dimensions, but was hoping that it would not be almost 10 times slower. As I am not a Python expert, I was hoping that there is a clever way in which my variable dimensions generator expression could be reformulated to give me better performance.
What is the most efficient way to filter a huge dictionary in the way I described?

Thanks for the opportunity to think about tuples in sets and dictionaries. It's a very useful and powerful corner of Python.
Python is interpreted, so if you've come from a compiled language, one good rule of thumb is to avoid complex nested iterations where you can. If you're writing complicated for loops or comprehensions it's always worth wondering if there's a better way to do it.
List subscripts (stuff[i]) and range (len(stuff)) are inefficient and long-winded in Python, and rarely necessary. It's more efficient (and more natural) to iterate:
for item in stuff:
do_something(item)
The following code is fast because it uses some of the strengths of Python: comprehensions, dictionaries, sets and tuple unpacking.
There are iterations, but they're simple and shallow.
There's only one if statement in the whole of the code, and that's executed only 4 times per filter operation. That also helps performance-- and makes code easier to read.
An explanation of the method...
Each key from the original data:
{(1, 4, 5): 1}
is indexed by position and value:
{
(0, 1): (1, 4, 5),
(1, 4): (1, 4, 5),
(2, 5): (1, 4, 5)
}
(Python numbers elements from zero.)
Indexes are collated into one big lookup dictionary composed of sets of tuples:
{
(0, 1): {(1, 4, 5), (1, 6, 7), (1, 2), (1, 8), (1, 4, 2, 8), ...}
(0, 2): {(2, 1), (2, 2), (2, 4, 1, 8), ...}
(1, 4): {(1, 4, 5), (1, 4, 2, 8), (2, 4, 1, 8), ...}
...
}
Once this lookup is built (and it is built very efficiently) filtering is just set intersection and dictionary lookup, both of which are lightning-fast. Filtering takes microseconds on even a large dictionary.
The method handles data with tuples of arity 2, 3 or 4 (or any other) but arity_filtered() returns only keys with the same number of members as the filter tuple. So this class gives you the option of filtering all data together, or handling the different sizes of tuple separately, with little to choose between them as regards performance.
Timing results for the large random dataset (11,500 tuples) were 0.30s to build the lookup, 0.007 seconds for 100 lookups.
from collections import defaultdict
import random
import timeit
class TupleFilter:
def __init__(self, data):
self.data = data
self.lookup = self.build_lookup()
def build_lookup(self):
lookup = defaultdict(set)
for data_item in self.data:
for member_ref, data_key in tuple_index(data_item).items():
lookup[member_ref].add(data_key)
return lookup
def filtered(self, tuple_filter):
# initially unfiltered
results = self.all_keys()
# reduce filtered set
for position, value in enumerate(tuple_filter):
if value is not None:
match_or_empty_set = self.lookup.get((position, value), set())
results = results.intersection(match_or_empty_set)
return results
def arity_filtered(self, tuple_filter):
tf_length = len(tuple_filter)
return {match for match in self.filtered(tuple_filter) if tf_length == len(match)}
def all_keys(self):
return set(self.data.keys())
def tuple_index(item_key):
member_refs = enumerate(item_key)
return {(pos, val): item_key for pos, val in member_refs}
data = {
(1, 2): 1,
(1, 5): 2,
(1, 5, 3): 2,
(5, 2, 5, 2): 8
}
tests = {
(1, 5): 2,
(1, None, 3): 1,
(1, None): 3,
(None, 5): 2,
}
tf = TupleFilter(data)
for filter_tuple, expected_length in tests.items():
result = tf.filtered(filter_tuple)
print("Filter {0} => {1}".format(filter_tuple, result))
assert len(result) == expected_length
# same arity filtering
filter_tuple = (1, None)
print('Not arity matched: {0} => {1}'
.format(filter_tuple, tf.filtered(filter_tuple)))
print('Arity matched: {0} => {1}'
.format(filter_tuple, tf.arity_filtered(filter_tuple)))
# check unfiltered results return original data set
assert tf.filtered((None, None)) == tf.all_keys()
>>> python filter.py
Filter (1, 5) finds {(1, 5), (1, 5, 3)}
Filter (1, None, 3) finds {(1, 5, 3)}
Filter (1, None) finds {(1, 2), (1, 5), (1, 5, 3)}
Filter (None, 5) finds {(1, 5), (1, 5, 3)}
Arity filtering: note two search results only: (1, None) => {(1, 2), (1, 5)}

I've made some modifications:
you don't need to use dict.keys method to iterate through keys, iterating through dict object itself will give us its keys,
created separate modules, it helps to read and modify:
preparations.py with helpers for generating test data:
import random
left_ends = [200, 10, 4, 7]
def generate_all_entries(count):
return {tuple(random.randint(1, num)
for num in left_ends): 1
for _ in range(count)}
def generate_entry_keys(count):
return [tuple(get_rand_or_none(1, num)
for num in left_ends)
for _ in range(count)]
def get_rand_or_none(start, stop):
number = random.randint(start - 1, stop)
if number == start - 1:
number = None
return number
functions.py for tested functions,
main.py for benchmarks.
passing arguments to function instead of getting them from global scope, so given static & variable length versions become
def access_static_length(all_entries, entry_keys):
for key in entry_keys:
for k in (x
for x in all_entries
if (key[0] is None or x[0] == key[0])
and (key[1] is None or x[1] == key[1])
and (key[2] is None or x[2] == key[2])
and (key[3] is None or x[3] == key[3])):
pass
def access_variable_length(all_entries, entry_keys):
for key in entry_keys:
for k in (x
for x in all_entries
if all(key[i] is None or key[i] == x[i]
for i in range(len(key)))):
pass
using min on results of timeit.repeat instead of timeit.timeit to get most representable results (more in this answer),
changing entries_keys elements count from 10 to 100 (including ends) with step 10,
changing all_entries elements count from 10000 to 15000 (including ends) with step 500.
But getting back to the point.
Improvements
We can improve filtration by skipping checks for indexes with None values in keys
def access_variable_length_with_skipping_none(all_entries, entry_keys):
for key in entry_keys:
non_none_indexes = {i
for i, value in enumerate(key)
if value is not None}
for k in (x
for x in all_entries.keys()
if all(key[i] == x[i]
for i in non_none_indexes)):
pass
Next suggestion is to use numpy:
import numpy as np
def access_variable_length_numpy(all_entries, entry_keys):
keys_array = np.array(list(all_entries))
for entry_key in entry_keys:
non_none_indexes = [i
for i, value in enumerate(entry_key)
if value is not None]
non_none_values = [value
for i, value in enumerate(entry_key)
if value is not None]
mask = keys_array[:, non_none_indexes] == non_none_values
indexes, _ = np.where(mask)
for k in map(tuple, keys_array[indexes]):
pass
Benchmarks
Contents of main.py:
import timeit
from itertools import product
number = 5
repeat = 10
for all_entries_count, entry_keys_count in product(range(10000, 15001, 500),
range(10, 101, 10)):
print('all entries count: {}'.format(all_entries_count))
print('entry keys count: {}'.format(entry_keys_count))
preparation_part = ("from preparation import (generate_all_entries,\n"
" generate_entry_keys)\n"
"all_entries = generate_all_entries({all_entries_count})\n"
"entry_keys = generate_entry_keys({entry_keys_count})\n"
.format(all_entries_count=all_entries_count,
entry_keys_count=entry_keys_count))
static_time = min(timeit.repeat(
"access_static_length(all_entries, entry_keys)",
preparation_part + "from functions import access_static_length",
repeat=repeat,
number=number))
variable_time = min(timeit.repeat(
"access_variable_length(all_entries, entry_keys)",
preparation_part + "from functions import access_variable_length",
repeat=repeat,
number=number))
variable_time_with_skipping_none = min(timeit.repeat(
"access_variable_length_with_skipping_none(all_entries, entry_keys)",
preparation_part +
"from functions import access_variable_length_with_skipping_none",
repeat=repeat,
number=number))
variable_time_numpy = min(timeit.repeat(
"access_variable_length_numpy(all_entries, entry_keys)",
preparation_part +
"from functions import access_variable_length_numpy",
repeat=repeat,
number=number))
print("static length time: {}".format(static_time))
print("variable length time: {}".format(variable_time))
print("variable length time with skipping `None` keys: {}"
.format(variable_time_with_skipping_none))
print("variable length time with numpy: {}"
.format(variable_time_numpy))
which on my machine with Python 3.6.1 gives:
all entries count: 10000
entry keys count: 10
static length time: 0.06314293399918824
variable length time: 0.5234129569980723
variable length time with skipping `None` keys: 0.2890012050011137
variable length time with numpy: 0.22945181500108447
all entries count: 10000
entry keys count: 20
static length time: 0.12795891799760284
variable length time: 1.0610534609986644
variable length time with skipping `None` keys: 0.5744297259989253
variable length time with numpy: 0.5105678180007089
all entries count: 10000
entry keys count: 30
static length time: 0.19210158399801003
variable length time: 1.6491422000035527
variable length time with skipping `None` keys: 0.8566724129996146
variable length time with numpy: 0.7363859869983571
all entries count: 10000
entry keys count: 40
static length time: 0.2561357790000329
variable length time: 2.08878050599742
variable length time with skipping `None` keys: 1.1256247100027394
variable length time with numpy: 1.0066140279996034
all entries count: 10000
entry keys count: 50
static length time: 0.32130833200062625
variable length time: 2.6166040710013476
variable length time with skipping `None` keys: 1.4147321179989376
variable length time with numpy: 1.1700750320014777
all entries count: 10000
entry keys count: 60
static length time: 0.38276188999952865
variable length time: 3.153736616997776
variable length time with skipping `None` keys: 1.7147898039984284
variable length time with numpy: 1.4533947029995034
all entries count: 10000
entry keys count: 70
...
all entries count: 15000
entry keys count: 80
static length time: 0.7141444490007416
variable length time: 6.186657476999244
variable length time with skipping `None` keys: 3.376506028998847
variable length time with numpy: 3.1577993860009883
all entries count: 15000
entry keys count: 90
static length time: 0.8115685330012639
variable length time: 7.14327938399947
variable length time with skipping `None` keys: 3.7462387939995097
variable length time with numpy: 3.6140603050007485
all entries count: 15000
entry keys count: 100
static length time: 0.8950150890013902
variable length time: 7.829741768000531
variable length time with skipping `None` keys: 4.1662235900003
variable length time with numpy: 3.914334102999419
Resume
As we can see numpy version isn't so good as expected and it seems to be not numpy's fault.
If we remove converting filtered array records to tuples with map and just leave
for k in keys_array[indexes]:
...
then it will be extremely fast (faster than static length version), so the problem is in conversion from numpy.ndarray objects to tuple.
Filtering out None entry keys gives us about 50% speed gain, so feel free to add it.

I don't have a beautiful answer, but this sort of optimisation often makes code harder to read. But if you just need more speed here are two things you can do.
Firstly we can straightforwardly eliminate a repeated computation from inside the loop. You say that all the entries in each dictionary have the same length so you can compute that once, rather than repeatedly in the loop. This shaves off about 20% for me:
def access_variable_length():
try:
length = len(iter(entry_keys).next())
except KeyError:
return
r = list(range(length))
for key in entry_keys:
for k in (x for x in all_entries.keys() if all(key[i] is None or key[i] == x[i]
for i in r)):
pass
Not pretty, I agree. But we can make it much faster (and even uglier!) by building the fixed length function using eval. Like this:
def access_variable_length_new():
try:
length = len(iter(entry_keys).next())
except KeyError:
return
func_l = ["(key[{0}] is None or x[{0}] == key[{0}])".format(i) for i in range(length)]
func_s = "lambda x,key: " + " and ".join(func_l)
func = eval(func_s)
for key in entry_keys:
for k in (x for x in all_entries.keys() if func(x,key)):
pass
For me, this is nearly as fast as the static version.

Let's say you have a dictionary - d
d = {(1,2):3,(1,4):5,(2,4):2,(1,3):4,(2,3):6,(5,1):5,(3,8):5,(3,6):9}
First you can get dictionary keys-
keys = d.keys()
=>
dict_keys([(1, 2), (3, 8), (1, 3), (2, 3), (3, 6), (5, 1), (2, 4), (1, 4)])
Now let's define a function is_match which can decide for given two tuples, if they are equal or not based on your conditions-
is_match((1,7),(1,None)), is_match((1,5),(None,5)) and is_match((1,4),(1,4)) will return True while is_match((1,7),(1,8)), is_match((4,7),(6,12)) will return False.
def if_equal(a, b):
if a is None or b is None:
return True
else:
if a==b:
return True
else:
return False
is_match = lambda a,b: False not in list(map(if_equal, a, b))
tup = (1, None)
matched_keys = [key for key in keys if is_match(key, tup)]
=>
[(1, 2), (1, 3), (1, 4)]

Remove duplicate tuples from a list if they are exactly the same including order of items

I know questions similar to this have been asked many, many times on Stack Overflow, but I need to remove duplicate tuples from a list, but not just if their elements match up, their elements have to be in the same order. In other words, (4,3,5) and (3,4,5) would both be present in the output, while if there were both(3,3,5) and (3,3,5), only one would be in the output.
Specifically, my code is:
import itertools
x = [1,1,1,2,2,2,3,3,3,4,4,5]
y = []
for x in itertools.combinations(x,3):
y.append(x)
print(y)
of which the output is quite lengthy. For example, in the output, there should be both (1,2,1) and (1,1,2). But there should only be one (1,2,2).

set will take care of that:
>>> a = [(1,2,2), (2,2,1), (1,2,2), (4,3,5), (3,3,5), (3,3,5), (3,4,5)]
>>> set(a)
set([(1, 2, 2), (2, 2, 1), (3, 4, 5), (3, 3, 5), (4, 3, 5)])
>>> list(set(a))
[(1, 2, 2), (2, 2, 1), (3, 4, 5), (3, 3, 5), (4, 3, 5)]
>>>
set will remove only exact duplicates.

What you need is unique permutations rather than combinations:
y = list(set(itertools.permutations(x,3)))
That is, (1,2,2) and (2,1,2) will be considered as same combination and only one of them will be returned. They are, however, different permutations. Use set() to remove duplicates.
If afterwards you want to sort elements within each tuple and also have the whole list sorted, you can do:
y = [tuple(sorted(q)) for q in y]
y.sort()

No need to do for loop, combinations gives a generator.
x = [1,1,1,2,2,2,3,3,3,4,4,5]
y = list(set(itertools.combinations(x,3)))

This will probably do what you want, but it's vast overkill. It's a low-level prototype for a generator that may be added to itertools some day. It's low level to ease re-implementing it in C. Where N is the length of the iterable input, it requires worst-case space O(N) and does at most N*(N-1)//2 element comparisons, regardless of how many anagrams are generated. Both of those are optimal ;-)
You'd use it like so:
>>> x = [1,1,1,2,2,2,3,3,3,4,4,5]
>>> for t in anagrams(x, 3):
... print(t)
(1, 1, 1)
(1, 1, 2)
(1, 1, 3)
(1, 1, 4)
(1, 1, 5)
(1, 2, 1)
...
There will be no duplicates in the output. Note: this is Python 3 code. It needs a few changes to run under Python 2.
import operator
class ENode:
def __init__(self, initial_index=None):
self.indices = [initial_index]
self.current = 0
self.prev = self.next = self
def index(self):
"Return current index."
return self.indices[self.current]
def unlink(self):
"Remove self from list."
self.prev.next = self.next
self.next.prev = self.prev
def insert_after(self, x):
"Insert node x after self."
x.prev = self
x.next = self.next
self.next.prev = x
self.next = x
def advance(self):
"""Advance the current index.
If we're already at the end, remove self from list.
.restore() undoes everything .advance() did."""
assert self.current < len(self.indices)
self.current += 1
if self.current == len(self.indices):
self.unlink()
def restore(self):
"Undo what .advance() did."
assert self.current <= len(self.indices)
if self.current == len(self.indices):
self.prev.insert_after(self)
self.current -= 1
def build_equivalence_classes(items, equal):
ehead = ENode()
for i, elt in enumerate(items):
e = ehead.next
while e is not ehead:
if equal(elt, items[e.indices[0]]):
# Add (index of) elt to this equivalence class.
e.indices.append(i)
break
e = e.next
else:
# elt not equal to anything seen so far: append
# new equivalence class.
e = ENode(i)
ehead.prev.insert_after(e)
return ehead
def anagrams(iterable, count=None, equal=operator.__eq__):
def perm(i):
if i:
e = ehead.next
assert e is not ehead
while e is not ehead:
result[count - i] = e.index()
e.advance()
yield from perm(i-1)
e.restore()
e = e.next
else:
yield tuple(items[j] for j in result)
items = tuple(iterable)
if count is None:
count = len(items)
if count > len(items):
return
ehead = build_equivalence_classes(items, equal)
result = [None] * count
yield from perm(count)

You were really close. Just get permutations, not combinations. Order matters in permutations, and it does not in combinations. Thus (1, 2, 2) is a distinct permutation from (2, 2, 1). However (1, 2, 2) is considered a singular combination of one 1 and two 2s. Therefore (2, 2, 1) is not considered a distinct combination from (1, 2, 2).
You can convert your list y to a set so that you remove duplicates...
import itertools
x = [1,1,1,2,2,2,3,3,3,4,4,5]
y = []
for x in itertools.permutations(x,3):
y.append(x)
print(set(y))
And voila, you are done. :)

Using a set should probably work. A set is basically a container that doesn't contain any duplicated elements.
Python also includes a data type for sets. A set is an unordered
collection with no duplicate elements. Basic uses include membership
testing and eliminating duplicate entries. Set objects also support
mathematical operations like union, intersection, difference, and
symmetric difference.
import itertools
x = [1,1,1,2,2,2,3,3,3,4,4,5]
y = set()
for x in itertools.combinations(x,3):
y.add(x)
print(y)

Python, efficient way to operate on pair of coordinates

I have a data file which has latitude and longitude information which I have stored as a list of tuples of the form
[(lat1, lon1), (lat1, lon1), (lat2, lon2), (lat3, lon3), (lat3, lon3) ...]
As shown above the consecutive locations (lat, lon) may be the same if the location in the data file has not changed. Hence, the order is very important here. What I am interested in is a fairly efficient way to check when the coordinates change, lat1, lon1 -> lat2, lon2 etc. and then get the distance between these two coordinates.
I already have a function to get the distance of the form getDistance(lat1, lon1, lat2, lon2) which returns the calculated distance between these locations. I want to store these distances in a list from which I can do some plots later on.

You could combine a function that filters out duplicates with one that iterates over pairs:
First lets take care of eliminating duplicate subsequent entries in the list. Since we wish to preserve order, as well as allow duplicates that are not next to each other, we cannot use a simple set. So if we a list of coordinates such as [(0, 0), (4, 4), (4, 4), (1, 1), (0, 0)] the correct output would be [(0, 0), (4, 4), (1, 1), (0, 0)]. A simple function that accomplishes this is:
def filter_duplicates(items):
"""A generator that ignores subsequent entires that are duplicates
>>> items = [0, 1, 1, 2, 3, 3, 3, 4, 1]
>>> list(filter_duplicates(items))
[0, 1, 2, 3, 4, 1]
"""
prev = None
for item in items:
if item != prev:
yield item
prev = item
The yield statement is like a return that doesn't actually return. Each time it is called it passes the value back to the calling function. See What does the "yield" keyword do in Python? for a better explanation.
This simply iterates through each item and compares it to the previous item. If the item is different it yields it back to the calling function and stores it as the current previous item. Another way to write this function would have been:
def filter_duplicates_2(items):
result = []
prev = None
for item in items:
if item != prev:
result.append(item)
prev = item
return result
Though the accomplish the same thing, this way would end up require more memory and would be less efficient because it has to create a new list to store everything.
Now that we have have a way to ensure that every item is different than its neighbors, we need to calculate the distance between subsequent pairs. A simple way to do this is:
def pairs(iterable):
"""A generate over pairs of items in iterable
>>> list(pairs([0, 8, 2, 1, 3]))
[(0, 8), (8, 2), (2, 1), (1, 3)]
"""
iterator = iter(iterable)
prev = next(iterator)
for j in iterator:
yield prev, j
prev = j
This function is similar to the filter_duplicates function. It simply keeps track of the previous item that it observed, and for each item that it processes it yields that item and the previous item. The only trick it uses is that it assignes prev to the very first item in the list using the next() function call.
If we combine the two functions we end up with:
for (x1, y1), (x2, y2) in pairs(filter_duplicates(coords)):
distance = getDistance(x1, y1, x2, y2)

Here's a way to do it using just functions from itertools:
from itertools import *
l = [...]
ks = (k for k,g in groupby(l))
t1, t2 = tee(ks)
t2.next() # advance so we get adjacent pairs
for k1, k2 in izip(t1, t2):
# call getDistance on k1, k2
This groups adjacent equal elements, then uses a pair of tee'd iterators to pull out adjacent pairs from the group list.
Using just groupby:
l = [...]
gs = itertools.groupby(l)
last, _ = gs.next()
for k, g in gs:
# call getDistance on (last, k)
last = k

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: fast dictionary of big int keys - python

Just a hint. Have a look on itertools.combinations. This is not exactly what you are looking for (because it stores pair of values, not of indexes), but it can be a starting code: from itertools import combinations for (a, b) in combinations(l, 2): pairs.setdefault(a + b, []).append((a, b))

Related

I'm not able to understand this code in tuple

Generating 3-tuples from a set of 2-tuples

How can I filter a dictionary with arbitrary length tuples as keys efficiently?

Remove duplicate tuples from a list if they are exactly the same including order of items

Python, efficient way to operate on pair of coordinates

Categories

Resources