I have a data file which has latitude and longitude information which I have stored as a list of tuples of the form
[(lat1, lon1), (lat1, lon1), (lat2, lon2), (lat3, lon3), (lat3, lon3) ...]
As shown above the consecutive locations (lat, lon) may be the same if the location in the data file has not changed. Hence, the order is very important here. What I am interested in is a fairly efficient way to check when the coordinates change, lat1, lon1 -> lat2, lon2 etc. and then get the distance between these two coordinates.
I already have a function to get the distance of the form getDistance(lat1, lon1, lat2, lon2) which returns the calculated distance between these locations. I want to store these distances in a list from which I can do some plots later on.
You could combine a function that filters out duplicates with one that iterates over pairs:
First lets take care of eliminating duplicate subsequent entries in the list. Since we wish to preserve order, as well as allow duplicates that are not next to each other, we cannot use a simple set. So if we a list of coordinates such as [(0, 0), (4, 4), (4, 4), (1, 1), (0, 0)] the correct output would be [(0, 0), (4, 4), (1, 1), (0, 0)]. A simple function that accomplishes this is:
def filter_duplicates(items):
"""A generator that ignores subsequent entires that are duplicates
>>> items = [0, 1, 1, 2, 3, 3, 3, 4, 1]
>>> list(filter_duplicates(items))
[0, 1, 2, 3, 4, 1]
"""
prev = None
for item in items:
if item != prev:
yield item
prev = item
The yield statement is like a return that doesn't actually return. Each time it is called it passes the value back to the calling function. See What does the "yield" keyword do in Python? for a better explanation.
This simply iterates through each item and compares it to the previous item. If the item is different it yields it back to the calling function and stores it as the current previous item. Another way to write this function would have been:
def filter_duplicates_2(items):
result = []
prev = None
for item in items:
if item != prev:
result.append(item)
prev = item
return result
Though the accomplish the same thing, this way would end up require more memory and would be less efficient because it has to create a new list to store everything.
Now that we have have a way to ensure that every item is different than its neighbors, we need to calculate the distance between subsequent pairs. A simple way to do this is:
def pairs(iterable):
"""A generate over pairs of items in iterable
>>> list(pairs([0, 8, 2, 1, 3]))
[(0, 8), (8, 2), (2, 1), (1, 3)]
"""
iterator = iter(iterable)
prev = next(iterator)
for j in iterator:
yield prev, j
prev = j
This function is similar to the filter_duplicates function. It simply keeps track of the previous item that it observed, and for each item that it processes it yields that item and the previous item. The only trick it uses is that it assignes prev to the very first item in the list using the next() function call.
If we combine the two functions we end up with:
for (x1, y1), (x2, y2) in pairs(filter_duplicates(coords)):
distance = getDistance(x1, y1, x2, y2)
Here's a way to do it using just functions from itertools:
from itertools import *
l = [...]
ks = (k for k,g in groupby(l))
t1, t2 = tee(ks)
t2.next() # advance so we get adjacent pairs
for k1, k2 in izip(t1, t2):
# call getDistance on k1, k2
This groups adjacent equal elements, then uses a pair of tee'd iterators to pull out adjacent pairs from the group list.
Using just groupby:
l = [...]
gs = itertools.groupby(l)
last, _ = gs.next()
for k, g in gs:
# call getDistance on (last, k)
last = k
Related
I am trying to get a list of lists that represent all possible ordered pairs from an existing list of lists.
import itertools
list_of_lists=[[0, 1, 2, 3, 4], [5], [6, 7],[8, 9],[10, 11],[12, 13],[14, 15],[16, 17],[18, 19],[20, 21],[22, 23],[24, 25],[26, 27],[28, 29],[30, 31],[32, 33],[34, 35],[36, 37],[38],[39]]
Ideally, we would just use itertools.product in order to get that list of ordered pairs.
scenarios_list=list(itertools.product(*list_of_lists))
However, if I were to do this for a larger list of lists I would get a memory error and so this solution is not scalable for larger lists of lists where there could be numerous different sets of ordered pairs.
So, is there a way to set up a process where we could iterate through these ordered pairs as they are produced where before appending the list to another list, we could test if the list satisfies a certain criteria (for example testing whether there are a certain number of even numbers, sum of list cannot be equal to the maximum, etc). If the criteria is not satisfied then the ordered pair would not be appended and thus not unnecessarily suck up memory when there are only certain ordered pairs that we care about.
Starting with a recursive base implementation of product:
def product(*lsts):
if not lsts:
yield ()
return
first_lst, *rest = lsts
for element in first_lst:
for rec_p in product(*rest):
p = (element,) + rec_p
yield p
[*product([1, 2], [3, 4, 5])]
# [(1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5)]
Now, you could augment that with a condition by which you filter any p not meeting it:
def product(*lsts, condition=None):
if condition is None:
condition = lambda tpl: True
if not lsts:
yield ()
return
first_lst, *rest = lsts
for element in first_lst:
for rec_p in product(*rest, condition=condition):
p = (element,) + rec_p
if condition(p): # stop overproduction right where it happens
yield p
Now you can - for instance - restrict to only even elements:
[*product([1, 2], [3, 4, 5], condition=lambda tpl: not any(x%2 for x in tpl))]
# [(2, 4)]
init_tuple = [(0, 1), (1, 2), (2, 3)]
result = sum(n for _, n in init_tuple)
print(result)
The output for this code is 6. Could someone explain how it worked?
Your code extracts each tuple and sums all values in the second position (i.e. [1]).
If you rewrite it in loops, it may be easier to understand:
init_tuple = [(0, 1), (1, 2), (2, 3)]
result = 0
for (val1, val2) in init_tuple:
result = result + val2
print(result)
The expression (n for _, n in init_tuple) is a generator expression. You can iterate on such an expression to get all the values it generates. In that case it reads as: generate the second component of each tuple of init_tuple.
(Note on _: The _ here stands for the first component of the tuple. It is common in python to use this name when you don't care about the variable it refers to (i.e., if you don't plan to use it) as it is the case here. Another way to write your generator would then be (tup[1] for tup in init_tuple))
You can iterate over a generator expression using for loop. For example:
>>> for x in (n for _, n in init_tuple):
>>> print(x)
1
2
3
And of course, since you can iterate on a generator expression, you can sum it as you have done in your code.
To get better understanding first look at this.
init_tuple = [(0, 1), (1, 2), (2, 3)]
sum = 0
for x,y in init_tuple:
sum = sum + y
print(sum)
Now, you can see that what above code does is that it calculate sum of second elements of tuple, its equivalent to your code as both does same job.
for x,y in init_tuple:
x hold first value of tuple and y hold second of tuple, in first iteration:
x = 0, y = 1,
then in second iteration:
x = 1, y = 2 and so on.
In your case you don't need first element of tuple so you just use _ instead of using variable.
list_of_tuple = [(0,2), (0,6), (4,6), (6,7), (8,9)]
Since (0,2) & (4,6) are both within the indexes of (0,6), so I want to remove them. The resulting list would be:
list_of_tuple = [(0,6), (6,7), (8,9)]
It seems I need to sort this tuple of list somehow to make it easier to remove. But How to sort a list of tuples?
Given two tuples of array indexes, [m,n] and [a,b], if:
m >=a & n<=b
Then [m,n] is included in [a,b], then remove [m,n] from the list.
To remove all tuples from list_of_tuples with a range out of the specified tuple:
list_of_tuple = [(0,2), (0,6), (4,6), (6,7), (8,9)]
def rm(lst,tup):
return [tup]+[t for t in lst if t[0] < tup[0] or t[1] > tup[1]]
print(rm(list_of_tuple,(0,6)))
Output:
[(0, 6), (6, 7), (8, 9)]
Here's a dead-simple solution, but it's O(n2):
intervals = [(0, 2), (0, 6), (4, 6), (6, 7), (8, 9)] # list_of_tuple
result = [
t for t in intervals
if not any(t != u and t[0] >= u[0] and t[1] <= u[1] for u in intervals)
]
It filters out intervals that are not equal to, but contained in, any other intervals.
Seems like an opportunity to abuse both reduce() and Python's logical operators! Solution assumes list is sorted as in the OP's example, primarily on the second element of each tuple, and secondarily on the first:
from functools import reduce
list_of_sorted_tuples = [(0, 2), (0, 6), (4, 6), (6, 7), (8, 9)]
def contains(a, b):
return a[0] >= b[0] and a[1] <= b[1] and [b] or b[0] >= a[0] and b[1] <= a[1] and [a] or [a, b]
reduced_list = reduce(lambda x, y: x[:-1] + contains(x[-1], y) if x else [y], list_of_sorted_tuples, [])
print(reduced_list)
OUTPUT
> python3 test.py
[(0, 6), (6, 7), (8, 9)]
>
You could try something like this to check if both ends of the (half-open) interval are contained within another interval:
list_of_tuple = [(0,2), (0,6), (4,6), (6,7), (8,9)]
reduced_list = []
for t in list_of_tuple:
add = True
for o in list_of_tuple:
if t is not o:
r = range(*o)
if t[0] in r and (t[1] - 1) in r:
add = False
if add:
reduced_list.append(t)
print(reduced_list) # [(0, 6), (6, 7), (8, 9)]
Note: This assumes that your tuples are half-open intervals, i.e. [0, 6) where 0 is inclusive but 6 is exclusive, similar to how range would treat the start and stop parameters. A couple of small changes would have to be made for the case of fully closed intervals:
range(*o) -> range(o[0], o[1] + 1)
and
if t[0] in r and (t[1] - 1) in r: -> if t[0] in r and t[1] in r:
Here is the first step towards a solution that can be done in O(n log(n)):
def non_cont(lot):
s = sorted(lot, key = lambda t: (t[0], -t[1]))
i = 1
while i < len(s):
if s[i][0] >= s[i - 1][0] and s[i][1] <= s[i - 1][1]:
del s[i]
else:
i += 1
return s
The idea is that after sorting using the special key function, the each element that is contained in some other element, will be located directly after an element that contains it. Then, we sweep the list, removing elements that are contained by the element that precedes them. Now, the sweep and delete loop is, itself, of complexity O(n^2). The above solution is for clarity, more than anything else. We can move to the next implementation:
def non_cont_on(lot):
s = sorted(lot, key = lambda t: (t[0], -t[1]))
i = 1
result = s[:1]
for i in s:
if not (i[0] >= result[-1][0] and i[1] <= result[-1][1]):
result.append(i)
return result
There is no quadratic sweep and delete loop here, only a nice, linear process of constructing the result. Space complexity is O(n). It is possible to perform this algorithm without extra, non-constant, space, but I will leave this out.
A side effect of both algorithm is that the intervals are sorted.
If you want to preserve the information about the inclusion-structure (by which enclosing interval an interval of the original set is consumed) you can build a "one-level tree":
def contained(tpl1, tpl2):
return tpl1[0] >= tpl2[0] and tpl1[1] <= tpl2[1]
def interval_hierarchy(lst):
if not lst:
return
root = lst.pop()
children_dict = {root: []}
while lst:
t = lst.pop()
curr_children = list(children_dict.keys())
for k in curr_children:
if contained(k, t):
children_dict[t] = (children_dict[t] if t in children_dict else []) +\
[k, *children_dict[k]]
children_dict.pop(k)
elif contained(t, k):
children_dict[k].append(t)
if t in children_dict:
children_dict[k] += children_dict[t]
children_dict.pop(t)
else:
if not t in children_dict:
children_dict[t] = []
# return whatever information you might want to use
return children_dict, list(children_dict.keys())
It appears you are trying to merge intervals which are overlapping. For example, (9,11), (10,12) are merged in the second example below to produce (9,12).
In that case, a simple sort using sorted will automatically handle tuples.
Approach: Store the next interval to be added. Keep extending the end of the interval until you encounter a value whose "start" comes after (>=) the "end" of the next value to add. At that point, that stored next interval can be appended to the results. Append at the end to account for processing all values.
def merge_intervals(val_input):
if not val_input:
return []
vals_sorted = sorted(val_input) # sorts by tuple values "natural ordering"
result = []
x0, x1 = vals_sorted[0] # store next interval to be added as (x0, x1)
for start, end in vals_sorted[1:]:
if start >= x1: # reached next separate interval
result.append((x0, x1))
x0, x1 = (start, end)
elif end > x1:
x1 = end # extend length of next interval to be added
result.append((x0, x1))
return result
print(merge_intervals([(0,2), (0,6), (4,6), (6,7), (8,9)]))
print(merge_intervals([(1,2), (9,11), (10,12), (1,7)]))
Output:
[(0, 6), (6, 7), (8, 9)]
[(1, 7), (9, 12)]
In an earlier question:
Generating maximum number of 3-tuples from a list of 2-tuples
I got an answer from #AChampion that seems to work if the number of 2-tuples is divisible by 3. However, the solution fails if we, for example, have 10 2-tuples. After fumbling with it for a while I'm under the impression that it is impossible to find a perfect solution for say:
(1,2)(1,3),(1,4),(2,3),(2,4),(3,4)
So I'm interested in finding one solution that minimizes the number of remainder tuples. In the example above the result could be:
(1,2,3) # derived from (1,2), (1,3), (2,3)
(1,4),(2,4),(3,4) # remainder tuples
The rule for generating 3-tuple from 3 2-tuple is:
(a,b), (b,c), (c,a) -> (a, b, c)
i.e. the 2-tuples is a cycle with length 3. The order of the elements in a 3-tuple is not important, i.e:
(a,b,c) == (c,a,b)
I'm actually interested in the case where we have a number n:
for x in range(1,n+1):
for y in range(1,n+1):
if x!=y:
a.append((x,y))
# a = [ (1,2),...,(1,n), (2,1),(2,3),...,(2,n),...(n,1),...,(n,n-1) ]
From a, minimize the number of 2-tuples that is left when producing 3-tuples. Each 2-tuple can only be used once.
I wrapped my brain around this for several hours but I can't seem to come up with an elegant solution (well, neither have I found an ugly one:-) for the general case. Any thoughts?
For this you need to create number of combinations that will use for replacement. Then loop over you data for 3 item that contains any of above combinations and replace them.
I have done thi in several steps.
from itertools import combinations
# create replacements elements
number_combinations_raw = list(combinations(range(1, 5), 3))
# create proper number combinations
number_combinations = []
for item in number_combinations_raw:
if (item[0] + 1 == item[1]) and (item[1] + 1 == item[2]):
number_combinations.append(item)
# create test data
data = [(1, 2), (1, 3), (1, 4), (2, 3), (2, 4)]
# reduce data
reduce_data = []
for number_set in number_combinations:
count = 0
merged_data = []
for item in data:
if (number_set[0] in item and number_set[1] in item) or (number_set[1] in item and number_set[2] in item) \
or (number_set[0] in item and number_set[2] in item):
merged_data.append(item)
count += 1
if count == 3:
reduce_data.append((number_set, merged_data))
# delete merged elements from data list and add replacement
for item in data:
for reduce_item in reduce_data:
for element in reduce_item[1]:
if element in data:
data.remove(element)
data = [reduce_item[0]] + data
# remove duplicated replaced elements
final_list = list(dict.fromkeys(data))
Output:
[(1, 2, 3), (1, 4), (2, 4)]
I have got a list of >10.000 int items. The values of the items can be very high, up to 10^27. Now I want to create all pairs of the items and calculate their sum. Then I want to look for different pairs with the same sum.
For example:
l[0] = 4
l[1] = 3
l[2] = 6
l[3] = 1
...
pairs[10] = [(0,2)] # 10 is the sum of the values of l[0] and l[2]
pairs[7] = [(0,1), (2,3)] # 7 is the sum of the values of l[0] and l[1] or l[2] and l[3]
pairs[5] = [(0,3)]
pairs[9] = [(1,2)]
...
The contents of pairs[7] is what I am looking for. It gives me two pairs with the same value sum.
I have implemented it as follows - and I wonder if it can be done faster. Currently, for 10.000 items it takes >6 hours on a fast machine. (As I said, the values of l and so the keys of pairs are ints up to 10^27.)
l = [4,3,6,1]
pairs = {}
for i in range( len( l ) ):
for j in range(i+1, len( l ) ):
s = l[i] + l[j]
if not s in pairs:
pairs[s] = []
pairs[s].append((i,j))
# pairs = {9: [(1, 2)], 10: [(0, 2)], 4: [(1, 3)], 5: [(0, 3)], 7: [(0, 1), (2, 3)]}
Edit: I want to add some background, as asked by Simon Stelling.
The goal is to find Formal Analogies like
lays : laid :: says : said
within a list of words like
[ lays, lay, laid, says, said, foo, bar ... ]
I already have a function analogy(a,b,c,d) giving True if a : b :: c : d. However, I would need to check all possible quadruples created from the list, which would be a complexity of around O((n^4)/2).
As a pre-filter, I want to use the char-count property. It says that every char has the same count in (a,d) and in (b,c). For instance, in "layssaid" we have got 2 a's, and so we do in "laidsays"
So the idea until now was
for every word to create a "char count vector" and represent it as an integer (the items in the list l)
create all pairings in pairs and see if there are "pair clusters", i.e. more than one pair for a particular char count vector sum.
And it works, it's just slow. The complexity is down to around O((n^2)/2) but this is still a lot, and especially the dictionary lookup and insert is done that often.
There are the trivial optimizations like caching constant values in a local variable and using xrange instead of range:
pairs = {}
len_l = len(l)
for i in xrange(len_l):
for j in xrange(i+1, len_l):
s = l[i] + l[j]
res = pairs.setdefault(s, [])
res.append((i,j))
However, it is probably far more wise to not pre-calculate the list and instead optimize the method on a concept level. What is the intrinsic goal you want to achieve? Do you really just want to calculate what you do? Or are you going to use that result for something else? What is that something else?
Just a hint. Have a look on itertools.combinations.
This is not exactly what you are looking for (because it stores pair of values, not of indexes), but it can be a starting code:
from itertools import combinations
for (a, b) in combinations(l, 2):
pairs.setdefault(a + b, []).append((a, b))
The above comment from SimonStelling is correct; generating all possible pairs is just fundamentally slow, and there's nothing you can do about it aside from altering your algorithm. The correct function to use from itertools is product; and you can get some minor improvements from not creating extra variables or doing unnecessary list indexes, but underneath the hood these are still all O(n^2). Here's how I would do it:
from itertools import product
l = [4,3,6,1]
pairs = {}
for (m,n) in product(l,repeat=2):
pairs.setdefault(m+n, []).append((m,n))
Finally, I have came up with my own solution, just taking half of the calculation time on average.
The basic idea: Instead of reading and writing into the growing dictionary n^2 times, I first collect all the sums in a list. Then I sort the list. Within the sorted list, I then look for same neighbouring items.
This is the code:
from operator import itemgetter
def getPairClusters( l ):
# first, we just store all possible pairs sequentially
# clustering will happen later
pairs = []
for i in xrange( len( l) ):
for j in xrange(i+1, len( l ) ):
pair = l[i] + l[j]
pairs.append( ( pair, i, j ) )
pairs.sort(key=itemgetter(0))
# pairs = [ (4, 1, 3), (5, 0, 3), (7, 0, 1), (7, 2, 3), (9, 1, 2), (10, 0, 2)]
# a list item of pairs now contains a tuple (like (4, 1, 3)) with
# * the sum of two l items: 4
# * the index of the two l items: 1, 3
# now clustering starts
# we want to find neighbouring items as
# (7, 0, 1), (7, 2, 3)
# (since 7=7)
pairClusters = []
# flag if we are within a cluster
# while iterating over pairs list
withinCluster = False
# iterate over pair list
for i in xrange(len(pairs)-1):
if not withinCluster:
if pairs[i][0] == pairs[i+1][0]:
# if not within a cluster
# and found 2 neighbouring same numbers:
# init new cluster
pairCluster = [ ( pairs[i][1], pairs[i][2] ) ]
withinCluster = True
else:
# if still within cluster
if pairs[i][0] == pairs[i+1][0]:
pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
# else cluster has ended
# (next neighbouring item has different number)
else:
pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
pairClusters.append(pairCluster)
withinCluster = False
return pairClusters
l = [4,3,6,1]
print getPairClusters(l)