A faster way of comparing two lists of point-tuples? - python

I have two lists (which may or may not be the same length). In each list, are a series of tuples of two points (basically X, Y values).
I am comparing the two lists against each other to find two points with similar point values. I have tried list comprehension techniques, but it got really confusing with the nested tuples inside of the lists and I couldn't get it to work.
Is this the best (fastest) way of doing this? I feel like there might be a more Pythonic way of doing this.
Say I have two lists:
pointPairA = [(2,1), (4,8)]
pointPairB = [(3,2), (10,2), (4,2)]
And then an empty list for storing the pairs and a tolerance value to store only matched pairs
matchedPairs = []
tolerance = 2
And then this loop that unpacks the tuples, compares the difference, and adds them to the matchedPairs list to indicate a match.
for pointPairA in pointPairListA:
for pointPairB in pointPairListB:
## Assign the current X,Y values for each pair
pointPairA_x, pointPairA_y = pointPairA
pointPairB_x, pointPairB_x = pointPairB
## Get the difference of each set of points
xDiff = abs(pointPairA_x - pointPairB_x)
yDiff = abs(pointPairA1_y - pointPairB_y)
if xDiff < tolerance and yDiff < tolerance:
matchedPairs.append((pointPairA, pointPairB))
That would result in matchedPairs looking like this, with tuples of both point tuples inside:
[( (2,1), (3,2) ), ( (2,1), (4,2) )]

Here pointpairA is the single list and pointpairB would be one of the list of 20k
from collections import defaultdict
from itertools import product
pointPairA = [(2,1), (4,8)]
pointPairB = [(3,2), (10,2), (4,2)]
tolerance = 2
dA = defaultdict(list)
tolrange = range(-tolerance, tolerance+1)
for pA, dx, dy in product(pointPairA, tolrange, tolrange):
dA[pA[0]+dx,pA[1]+dy].append(pA)
# you would have a loop here though the 20k lists
matchedPairs = [(pA, pB) for pB in pointPairB for pA in dA[pB]]
print matchedPairs

If these lists are large, I would suggest finding a faster algorithm...
I would start by sorting both lists of pairs by the sum of the (x,y) in the pair. (Because two points can be close only if their sums are close.)
For any point in the first list, that will severely limit the range you need to search in the second list. Keep track of a "sliding window" on the second list, corresponding to the elements whose sums are within 2*tolerance of the sum of the current element of the first list. (Actually, you only need to keep track of the start of the sliding window...)
Assuming tolerance is reasonably small, this should convert your O(n^2) operation into O(n log n).

With list comprehension:
[(pa, pb) for pa in pointPairA for pb in pointPairB \
if abs(pa[0]-pb[0]) <= tolerance and abs(pa[1]-pb[1]) <= tolerance]
Slightly much faster than your loop:
(for 1 million executions)
>>> (list comprehension).timeit()
2.1963138580322266 s
>>> (your method).timeit()
2.454944133758545 s

Related

Most Efficient Way To Separate Out Data from List within a List

I have a function with outputs 2 arrays, lets call them X_i, and Y_i which are two N x 1 arrays, where N is the number of points. By using multiprocessing's pool.apply_aync, I was able to parallelize this function which gave me my results in a HUGE list. The structure of the results are a list of M values, where each value is a list containing X_i and Y_i. So in summary, I have a huge list which M smaller lists containing the two arrays X_i and Y_i.
Now I have want append all the X_i's into one array called X and Y_i's called Y. What is the most efficient way to do this? I'm looking for some sort of parallel algorithm. Order does NOT matter!
So far I have just a simple for loop that separates this massive data array:
X = np.zeros((N,1))
Y = np.zeros((N,1))
for i in range(len(results))
X = np.append(results[i][0].reshape(N,1),X,axis = 1)
Y = np.append(results[i][1].reshape(N,1),Y,axis = 1)
I found this algorthim to be rather slow, so I need to speed it up! Thanks!
You should provide a simple scenario of your problem, break it down and give us a simple input, output scenario, it would help a lot, as all this variables and text make it a bit confusing.Maybe this can help;
You can unpack the lists, then grab the ones you need by index, append the list to your new empty X[] and append the other list you needed to Y[], at the end get the arrays out of the lists and merge those into your new N dimensional array or into a new list.
list = [[[1,2],[3,4]],[[4,5],[6,7]]]
sub_pre = []
flat_list = []
for sublist in list:
sub_pre.append(sublist)
for item in sublist:
flat_list.append(item)
print(list)
print(flat_list)
Thanks to #JonSG for the brilliant insight. This type of sorting algorithm can be sped up using array manipulation. Through the use of most parallels packages, a function that outputs in multiple arrays will most likely get put into a huge list. Here I have a list called results, which contains M smaller lists of two N x 1 arrays.
To unpack the main array and sort all the X_i and Y_i into their own X and Y arrays respectively, it can be done so like this.
np.shape(results) = (M, 2, N)
X = np.array(results)[:,0,:]
Y = np.array(results)[:,1,:]
This gave me an 100x speed increase!

How to save memory in python3?

I have some question about memory error in python3.6
import itertools
input_list = ['a','b','c','d']
group_to_find = list(itertools.product(input_list,input_list))
a = []
for i in range(len(group_to_find)):
if group_to_find[i] not in a:
a.append(group_to_find[i])
group_to_find = list(itertools.product(input_list,input_list))
MemoryError
You are creating a list, in full, from the Cartesian product of your input list, so in addition to input_list you now need len(input_list) ** 2 memory slots for all the results. You then filter that list down again to a 4th list. All in all, for N items you need memory for 2N + (N * N) references. If N is 1000, that's 1 million and 2 thousand references, for N = 1 million, you need 1 million million plus 2 million references. Etc.
Your code doesn't need to create the group_to_find list, at all, for two reasons:
You could just iterate and handle each pair individually:
a = []
for pair in itertools.product(input_list, repeat=2):
if pair not in a:
a.append(pair)
This is still going to be slow, because pair not in a has to scan the whole list to find matches. You do this N times, for up to K pairs (where K is the product of the number of unique values in input_list, potentially equal to N), so that's N * K time spent checking for duplicates. You could use a = set() to make that faster. But see point 2.
Your end product in a is the exact same list of pairs that itertools.product() would produce anyway, unless you input values are not unique. You could just make those unique first:
a = itertools.product(set(input_list), repeat=2)
Again, don't put this in a list. Iterate over it in a loop and use the pairs it produces one by one.

Number of distinct contiguous subarray

import math
n=7 #length of list
k=2 #number
arr=[1,1,1,1,4,5,1]
l=n
def segmentedtree(segmentedtreearr,arr,low,high,pos): #function to build segment tree
if low==high:
segmentedtreearr[pos]=arr[high]
return
mid=(low+high)//2
segmentedtree(segmentedtreearr,arr,low,mid,((2*pos)+1))
segmentedtree(segmentedtreearr,arr,mid+1,high,((2*pos)+2))
segmentedtreearr[pos]=segmentedtreearr[((2*pos)+1)]+segmentedtreearr[((2*pos)+2)]
flag=int(math.ceil(math.log2(n))) #calculating height of segment tree
size=2*int(math.pow(2,flag))-1#calculating size
segmentedtreearr=[0]*(size)
low=0
high=l-1
pos=0
segmentedtree(segmentedtreearr,arr,low,high,pos)
if (n%2==0):
print (segmentedtreearr.count(k)+1)
else:
print (segmentedtreearr.count(k))
Now arr=[1,1,1,1,4,5,1] so different possible combinations for sum equal to k=2 can be [1,1] using index (0,1) and [1,1] using index (1,2) and [1,1] using index (2,3) but i am getting 2 as a output although my implementation is correct.
Segment trees are good for looking up ranges when you have an absolute point, but in your case you have a relative measure you are looking for (a sum).
Your code is missing a pair of ones that are in two different branches of the tree:
As you can imagine, larger sums could span several branches (like for sum = 7). There is no trivial way to make use of this tree to answer the question.
It is much easier with a simple iteration through the list, using two indexes (left and right of a range), incrementing the left index when the sum is too large and incrementing the right index when it is too small. This assumes that all values in the input list are positive, which is stated in your reference to hackerrank:
def count_segments_with_sum(lst, total):
i = 0
count = 0
for j, v in enumerate(lst):
total -= v
while total < 0:
total += lst[i]
i += 1
count += not total
return count
print(count_segments_with_sum([1,1,1,1,4,5,1], 2)) # -> 3
Here is an O(n) solution discarding the tree approach. It uses accumulate and groupby from itertools and merge from heapq:
It is not very optimized. My focus was on demonstrating the principle and using vectorizable components.
import itertools as it, operator as op, heapq as hq
arr=[1,1,1,1,4,5,1]
k = 2
N = len(arr)
# compute cumulative sum (starting at zero) and again shifted by `-k`
ps = list(it.chain(*(it.accumulate(it.chain((i,), arr), op.add) for i in (0,-k))))
# merge the cumsum and shifted cumsum, do this indirectly (index based); observe that any eligible subsequence will result in a repeated number in the merge
idx = hq.merge(range(N+1), range(N+1, 2*N+2), key=ps.__getitem__)
# use groupby to find repeats
grps = (list(grp) for k, grp in it.groupby(idx, key=ps.__getitem__))
grps = (grp for grp in grps if len(grp) > 1)
grps = [(i, j-N-1) for i, j in grps]
Result:
[(0, 2), (1, 3), (2, 4)]
Some more detailed explanation:
1) we build the sequence ps = {0, arr_0, arr_0 + arr_1, arr_0 + arr_1 + arr_2, ...} of cumulative sums of arr. This is useful because andy sum of a stretch of elements can be written as the difference between two terms in ps.
2) in particular, a contiguous subsequence that sums to k will correspond to a pair of elements of ps whose difference is k. To find those we make a copy of ps and subtract k from each element. We therefore need to find numbers that are in ps and in the shifted ps.
3) because ps and ps shifted are sorted (assuming the terms of arr are positive) the numbers that are in ps and ps shifted can be found in O(n) using merge which puts such pairs next to each other. If I remember correctly, the merge is guaranteed to be stable, so we can rely on the element from ps coming first in any such pair.
4) it remains to find the pairs which we do using groupby.
5) But wait a minute. If we do this directly all we got in the end are pairs of equal values. If you just want to count them that's fine, but if we want the actual sublists we have to do the merge indirectly, using the key kwd arg which works in the same way as in sorted
6) So we create two ranges of indices and use list.__getitem__ as key function because we have two lists but can only pass one key, we concatenate the lists first. As a consequence the indices into the first and second list are unique.
7) the result is a list of indices idx such that ps[idx[0]], ps[idx[1]], ... is sorted (ps in the program is ps with ps-k already glued to it) using the same key function as before we can do the groupby indirectly, on idx.
8) we then discard all groups that have only a single element and for the remaining pairs shift back the second index.

Cython dictionary / map

I have a list of element, label pairs like this: [(e1, l1), (e2, l2), (e3, l1)]
I have to count how many labels two element have in common - ie. in the list above e1and e3have the label l1 in common and thus 1 label in common.
I have this Python implementation:
def common_count(e_l_list):
count = defaultdict(int)
l_list = defaultdict(set)
for e1, l in e_l_list:
for e2 in l_list[l]:
if e1 == e2:
continue
elif e1 > e2:
count[e1,e2] += 1
else:
count[e2,e1] += 1
l_list[l].add(e1)
return count
It takes a list like the one above and computes a dictionary of element pairs and counts. The result for the list above should give {(e1, e2): 1}
Now i have to scale this to millions of elements and labels and i though Cython would be a good solution to save CPU time and memory. But i can't find docs on how to use maps in Cython.
How would i implement the above in pure Cython?
It can be asumed that all elements and labels are unsigned integers.
Thanks in advance :-)
I think you are trying to over complicate this by creating pairs of elements and storing all common labels as the value when you can create a dict with the element as the key and have a list of all values associated with that element. When you want to find common labels convert the lists to a set and perform an intersection on them, the resulting set will have the common labels between the two. The average time of the intersection, checked with ~20000 lists, is roughly 0.006 or very fast
I tested this with this code
from collections import *
import random
import time
l =[]
for i in xrange(10000000):
#With element range 0-10000000 the dictionary creation time increases to ~16 seconds
l.append((random.randrange(0,50000),random.randrange(0,50000)))
start = time.clock()
d = defaultdict(list)
for i in l: #O(n)
d[i[0]].append(i[1]) #O(n)
print time.clock()-start
times = []
for i in xrange(10000):
start = time.clock()
tmp = set(d[random.randrange(0,50000)]) #picks a random list of labels
tmp2 = set(d[random.randrange(0,50000)]) #not guaranteed to be a different list but more than likely
times.append(time.clock()-start)
common_elements = tmp.intersection(tmp2)
print sum(times)/100.0
18.6747529999 #creation of list
4.17812619876 #creation of dictionary
0.00633531142994 #intersection
Note: The times do change slightly depending on number of labels. Also creating the dict might be too long for your situation but that is only a one time operation.
I would also highly not recommend creating all pairs of elements. If you have 5,000,000 elements and they all share at least one label, which is worst case, then you are looking at 1.24e+13 pairs or, more bluntly, 12.5 trillion. That would be ~1700 terabytes or ~1.7 petabytes

Obtaining the first and second "column's" from a pair of lists

I have many pairs of lists of variable lengths (5,4,6 pairs etc..) inside a single big list, lets call it LIST. Here are two lists among the many inside the big LIST as an example:
[(38.621833, -10.825707),
(38.572191, -10.84311), -----> LIST[0]
(38.580202, -10.860877),
(38.610917, -10.85217),
(38.631526, -10.839338)]
[(38.28152, -10.744559),
(38.246368, -10.744552), -----> LIST[1]
(38.246358, -10.779088),
(38.281515, -10.779096)]
I need to create two seperate variables lets say, of which one variable will have the first "column" (i.e. LIST[0][0][0], LIST[0][1][0] AND SO ON) of all the pairs of the lists(i.e. 38.621833, 38.572191 etc) and the second variable will have the second "column" (i.e. LIST[0][0][1], LIST[0][1][1] AND SO ON) of all the pairs of the lists.
So finally I will have two variables (say x,y) that will contain all the values of the first and second "columns" of all the lists in the LIST.
The problem I face is that all these lists are not of the same length!!
I tried
x = []
y = []
for i in range(len(LIST)):
x.append(LIST[i][0][0]) #append all the values of the first numbers
y.append(LIST[i][1][1]) #append all the values of the second numbers
What I expect:
x = (38.621833,38.572191,38.580202,38.610917,38.631526,38.28152,38.246368,38.246358,38.281515)
y = (-10.825707,-10.84311,-10.860877,-10.85217,-10.839338,-10.744559,-10.744552,-10.779088,-10.779096)
But here because of the variable pairs, my loop stops abrubptly in between.
I know I need to also change the LIST[i][j][0] here, and j changes with each list. But because of the different pairs, I don't know how to go about.
How do I go about doing this?
I would use two simple for loops (it's also generic for LIST being longer than 2):
x=[]
y=[]
for i in range(len(LIST)):
for j in LIST[i]:
x.append(j[0])
y.append(j[1])
You should transpose the sublists and use itertool.chain to create a single list:
from itertools import chain
zipped = [zip(*x) for x in l]
x, y = chain.from_iterable(ele[0] for ele in zipped),chain.from_iterable(ele[1] for ele in zipped)
print(list(x),list(y))
[38.621833, 38.572191, 38.580202, 38.610917, 38.631526, 38.28152, 38.246368, 38.246358, 38.281515] [-10.825707, -10.84311, -10.860877, -10.85217, -10.839338, -10.744559, -10.744552, -10.779088, -10.779096]
for ele1,ele2 in zip(x,y):
print(ele1,ele2)
38.621833 -10.825707
38.572191 -10.84311
38.580202 -10.860877
38.610917 -10.85217
38.631526 -10.839338
38.28152 -10.744559
38.246368 -10.744552
38.246358 -10.779088
38.281515 -10.779096
Here you go. tuple as requested.
my = [(38.621833, -10.825707),(38.572191, -10.84311),(38.580202, -10.860877),(38.610917, -10.85217),(38.631526, -10.839338)]
my1 = [(38.28152, -10.744559),(38.246368, -10.744552),(38.246358, -10.779088),(38.281515, -10.779096)]
l1 = map(tuple,zip(*my))[0]
l2 = map(tuple,zip(*my))[1]
print l1,l2
Output:
(38.621833, 38.572191, 38.580202, 38.610917, 38.631526)(-10.825707, -10.84311, -10.860877, -10.85217, -10.839338)
Use map function with zip and * stuple operator.
l = [(38.621833, -10.825707),
(38.572191, -10.84311),
(38.580202, -10.860877),
(38.610917, -10.85217),
(38.631526, -10.839338)]
x= map(list, zip(*l))[0]
y = map(list, zip(*l))[1]
print 'x = {},\n y = {}' .format(x,y)
x = [38.621833, 38.572191, 38.580202, 38.610917, 38.631526],
y = [-10.825707, -10.84311, -10.860877, -10.85217, -10.839338]
or if you don't want to store it in variables then d0n't use indexing in above solution,
map(list, zip(*l)) # will give you a nested list
Your LIST extends out of 2 lists.
With
for i in range(len(LIST)):
you run exactly 2 times through your loop.
If you want to solve your problem with for-loops your need to nest them:
#declare x, y as lists
x = []
y = []
for i_list in LIST:
#outer for-loop runs 2 times - one for each list appended to LIST.
#1st run: i_list becomes LIST[0]
#2nd run: i_list becomes LIST[1]
for touple in i_list:
#inner for-loop runs as often as the number of tuple appended to i_list
#touple becomes the content of i_list[#run]
x.append(touple[0]) #adds x-value to x
y.append(touple[1]) #adds y-value to y
If you prefer working with indexes use:
for i in range(len(LIST)):
for j in range(len(LIST[i])):
x.append(LIST[i][j][0])
y.append(LIST[i][j][1]])
NOT working with indexes for appending x- or y-values is much easier to write (saves complex thoughts about the List-Structure and correct using of indexes) and is much more comprehensible for extern people reading your code.

Categories

Resources