Optimize finding pairs of arrays which can be compared

Optimize finding pairs of arrays which can be compared - python

Definition: Array A(a1,a2,...,an) is >= than B(b1,b2,...bn) if they are equal sized and a_i>=b_i for every i from 1 to n.
For example:
[1,2,3] >= [1,2,0]
[1,2,0] not comparable with [1,0,2]
[1,0,2] >= [1,0,0]
I have a list which consists of a big number of such arrays (approx. 10000, but can be bigger). Arrays' elements are positive integers. I need to remove all arrays from this list that are bigger than at least one of other arrays. In other words: if there exists such B that A >= B then remove A.
Here is my current O(n^2) approach which is extremely slow. I simply compare every array with all other arrays and remove it if it's bigger. Are there any ways to speed it up.
import numpy as np
import time
import random
def filter_minimal(lst):
n = len(lst)
to_delete = set()
for i in xrange(n-1):
if i in to_delete:
continue
for j in xrange(i+1,n):
if j in to_delete: continue
if all(lst[i]>=lst[j]):
to_delete.add(i)
break
elif all(lst[i]<=lst[j]):
to_delete.add(j)
return [lst[i] for i in xrange(len(lst)) if i not in to_delete]
def test(number_of_arrays,size):
x = map(np.array,[[random.randrange(0,10) for _ in xrange(size)] for i in xrange(number_of_arrays)])
return filter_minimal(x)
a = time.time()
result = test(400,10)
print time.time()-a
print len(result)
P.S. I've noticed that using numpy.all instead of builtin python all slows the program dramatically. What can be the reason?

Might not be exactly what you are asking for, but this should get you started.
import numpy as np
import time
import random
def compare(x,y):
#Reshape x to a higher dimensional array
compare_array=x.reshape(-1,1,x.shape[-1])
#You can now compare every x with every y element wise simultaneously
mask=(y>=compare_array)
#Create a mask that first ensures that all elements of y are greater then x and
#then ensure that this is the case at least once.
mask=np.any(np.all(mask,axis=-1),axis=-1)
#Places this mask on x
return x[mask]
def test(number_of_arrays,size,maxval):
#Create arrays of size (number_of_arrays,size) with maximum value maxval.
x = np.random.randint(maxval, size=(number_of_arrays,size))
y= np.random.randint(maxval, size=(number_of_arrays,size))
return compare(x,y)
print test(50,10,20)

First of all we need to carefully check the objective. Is it true that we delete any array that is > ANY of the other arrays, even the deleted ones? For example, if A > B and C > A and B=C, then do we need to delete only A or both A and C? If we only need to delete INCOMPATIBLE arrays, then it is a much harder problem. This is a very difficult problem because different partitions of the set of arrays may be compatible, so you have the problem of finding the largest valid partition.
Assuming the easy problem, a better way to define the problem is that you want to KEEP all arrays which have at least one element < the corresponding element in ALL the other arrays. (In the hard problem, it is the corresponding element in the other KEPT arrays. We will not consider this.)
Stage 1
To solve this problem what you do is arrange the arrays in columns and then sort each row while maintaining the key to the array and the mapping of each array-row to position (POSITION lists). For example, you might end up with a result in stage 1 like this:
row 1: B C D A E
row 2: C A E B D
row 3: E D B C A
Meaning that for the first element (row 1) array B has a value >= C, C >= D, etc.
Now, sort and iterate the last column of this matrix ({E D A} in the example). For each item, check if the element is less than the previous element in its row. For example, in row 1, you would check if E < A. If this is true you return immediately and keep the result. For example, if E_row1 < A_row1 then you can keep array E. Only if the values in the row are equal do you need to do a stage 2 test (see below).
In the example shown you would keep E, D, A (as long as they passed the test above).
Stage 2
This leaves B and C. Sort the POSITION list for each. For example, this will tell you that the row with B's mininum position is row 2. Now do a direct comparison between B and every array below it in the mininum row, here row 2. Here there is only one such array, D. Do a direct comparison between B and D. This shows that B < D in row 3, therefore B is compatible with D. If the item is compatible with every array below its minimum position keep it. We keep B.
Now we do the same thing for C. In C's case we need only do one direct comparison, with A. C dominates A so we do not keep C.
Note that in addition to testing items that did not appear in the last column we need to test items that had equality in Stage 1. For example, imagine D=A=E in row 1. In this case we would have to do direct comparisons for every equality involving the array in the last column. So, in this case we direct compare E to A and E to D. This shows that E dominates D, so E is not kept.
The final result is we keep A, B, and D. C and E are discarded.
The overall performance of this algorithm is n2*log n in Stage 1 + { n lower bound, n * log n - upper bound } in Stage 2. So, maximum running time is n2*log n + nlogn and minimum running time is n2logn + n. Note that the running time of your algorithm is n-cubed n3. Since you compare each matrix (n*n) and each comparison is n element comparisons = n*n*n.
In general, this will be much faster than the brute force approach. Most of the time will be spent sorting the original matrix, a more or less unavoidable task. Note that you could potentially improve my algorithm by using priority queues instead of sorting, but the resulting algorithm would be much more complicated.

Related

Complexity of comparing elements in two lists

I was coding a function in Python to find elements of a sorted list that exist in another sorted list and print out the results:
# assume that both lists are sorted
def compare_sorted_lists(list1, list2):
res = []
a = 0
b = 0
while a < len(list1) and b < len(list2):
if list1[a] == list2[b]:
res.append(list1[a])
a += 1
elif list1[a] < list2[b]:
a += 1
else:
b += 1
return res
I want to figure out the time complexity of comparing elements with this method.
Assuming that:
list1 has length A and the maximum number of digits/letters in a list1 element is X
list2 has length B and the maximum number of digits/letters in a list2 element is Y
For these lists I have O(A+B) time complexity when traversing them with pointers, but how would comparing elements affect the time complexity for this function (specifically, worst-case time complexity)?
Edit: 12 March 2021 16:30 - rephrased question

The comparison between two elements is constant time, so this does not affect the complexity of your whole algorithm, which you corrected identified as O(A+B).

As user1717828 pointed out, the loop takes place at most A+B times; however comparing two elements is not a constant time operation. If the numbers are fixed point precision numbers, then yes, it is; however Python integers are unbounded. Time cost of their comparison will grow linearly with respect to the number of digits in them. Therefore the time complexity of the algorithm you gave is
O((A+B) * max{X,Y})
You can actually do better than that under specific circumstances. E.g. if A << B, then the following code has O(A*log(B)*max{X,Y}) time complexity.
for a in A:
split B from the middle and keep searching a in B in one of the blocks. Continue
until you find a, or not.
because the inner loops keeps diving the list B into 2, which can last for at most log_2(B)+1 steps.

How to apply a masked array to a very large JSON fast

The Data
I am currently working on very large JSON files formated as such
{key: [1000+ * arrays of length 241],
key2: [1000+ * arrays of length 241],
(...repeat 5-8 times...)}
The data is structured in a way that the nth element in each key's array belongs to the nth entity. Think about it as each key being a descriptor such as 'height' or 'pressure'. And therefore to get an entity's 'height' and 'pressure' you would access the entities index n in all the arrays. Therefore all the key's arrays are the same length Z
This, as you can imagine, is a pain to work with as a whole. Therefore, whenever I perform any data manipulation I return a masked array of length Z populated with 1's and 0's. 1 means the data in that index in every key is to be kept and 0 means it should be omitted)
The Problem
Once all of my data manipulation has been performed I need to apply the masked array to the data to return a copy of the original JSON data but where the length of each key's array Z is equal to the number of 1's in the masked array (If the element in the masked array at index n is a 0 then the element in index n will be removed from all of the json key's arrays and vice versa)
My attempt
# mask: masked array
# d: data to apply the mask to
def apply_mask(mask, d):
keys = d.keys()
print(keys)
rem = [] #List of index to remove
for i in range(len(mask)):
if mask[i] == 0:
rem.append(i) #Populate 'rem'
for k in keys:
d[k] = [elem for elem in d[k] if not d[k].index(elem) in rem]
return d
This works as intended but takes a while on such large JSON data
Question
I hope everything above was clear and helps you to understand my question:
Is there a more optimal/quicker way to apply a masked array to data such as this shown above?
Cheers

This is going to be slow because
d[k] = [elem for elem in d[k] if not d[k].index(elem) in rem]
is completely recreating the inner list every time.
Since you're already modifying d in-place, you could just delete the respective elements:
def apply_mask(mask, d):
for i, keep in enumerate(mask):
if not keep:
for key in d:
del d[key][i - len(mask)]
return d
(Negative indices i - len(mask) are being used because positive indices don't work anymore if the list has already changed its length due to previously removed elements.)

The problem comes from the high algorithmic complexity of the code. It is possible to design a much faster algorithm.
Let K be the number of keys in the dictionary d (ie. len(d)). Let Z be the size of the mask (ie. len(mask)), which is also the typical size of the array values in d (ie.len(d[key]) for any key).
The algorithmic complexity of the initial code is O(Z^3 * K). This is because rem is a list and in rem is done in linear time and also because d[k].index(elem) search elem in d[k] in linear time too.
The solution proposed by finefoot is faster. Indeed, the complexity of his code is O(Z^2 * K) (because del is done in linear time on CPython lists).
However, is is possible to do the computation in linear time: O(K * Z). Here is how:
def apply_mask(mask, d):
for key in d:
d[key] = [e for i,e in enumerate(d[key]) if mask[i]!=0]
return d
This code should be several order of magnitude faster.
PS: I think the initial algorithm is not correct regarding the description of the problem. Indeed, some items that should be kept can be removed since rem is not cleaned between iterations (and so indices are accumulated).

How can I find the values that are included in each of n arrays (Python)?

I'll preface this with saying that I'm new to Python, but not new to OOP.
I'm using numpy.where to find the indices in n arrays at which a particular condition is met, specifically if the value in the array is greater than x.
What I want to do is find the indicies in which all n arrays meet that condition - so in each each array, at index y, the element is greater than x.
n0[y] > x
n1[y] > x
n2[y] > x
n3[y] > x
For example, if my arrays after using numpy.where were:
a = [0,1,2,3,4,5,6,7,8,9,10]
b = [0,2,4,6,8,10,12,14,16,18,20]
c = [0,2,3,5,7,11,13,17,19,23]
d = [0,1,2,3,5,8,13,21,34,55]
I want to get the output
[0,2]
I found the function numpy.isin, which seems to do what I want for just two arrays. I don't know how to go about expanding this to more than two arrays and am not sure if it's possible.
Here's the start of my code, in which I generate the indices meeting my criteria:
n = np.empty([0])
n = np.append(n,np.where(sensor[i] > x)[0])
I'm a little stuck. I know I could create a new array with the same number of indicies as my original arrays and set the values in it to true or false, but that would not be very efficient and my original arrays are 25k+ elements long.

To find the intersection of n different arrays, first convert them all to sets. Then it is possible to apply set.intersection(). For the example with a, b, c and d, simply do:
set.intersection(*map(set, [a,b,c,d]))
This will result in a set {0, 2}.

How to create a list of all possible lists satisfying a certain condition?

I'm currently trying to do project Euler problem 18 (https://projecteuler.net/problem=18), using the 'brute force' method to check all possible paths. I've just been trying the smaller, 'model' triangle so far.
I was using list comprehension to create a list of lists where the inner lists would contain the indices for that line, for example:
lst = [[a,b,c,d] for a in [0] for b in [0,1] for c in [0,1,2] for d in
[0,1,2,3] if b == a or b == a + 1 if c == b or c == b + 1 if d == c or d ==
c + 1]
This gives me the list of lists I want, namely:
[[0,0,0,0],[0,0,0,1],[0,0,1,1],[0,0,1,2],[0,1,1,1],[0,1,1,2],[0,1,2,2],
[0,1,2,3]]
Note: the if conditions ensure that it only moves to adjacent numbers in the next row of the triangle, so that
lst[i][j] = lst[i][j-1] or lst[i][j] = lst[i][j]-1
After I got to this point, I intended that for each of the inner lists, I would take the numbers associated with those indices (so [0,0,0,0] would be 3,7,2,8) and sum over them, and this way get all of the possible sums, then take the maximum of those.
The problem is that if I were to scale this up to the big triangle I'd have fifteen 'for's and 'if's in my list comprehension. It seems like there must be an easier way! I'm pretty new to Python so hopefully there's some obvious feature I can make use of that I've missed so far!

What an interesting question! Here is a simple brute force approach, note the use of itertools to generate all the combinations, and then ruling out all the cases where successive row indices differ by more than one.
import itertools
import numpy as np
# Here is the input triangle
tri = np.array([[3],[7,4],[2,4,6],[8,5,9,3]])
indices = np.array([range(len(i)) for i in tri])
# Generate all the possible combinations
indexCombs = list(itertools.product(*indices))
# Generate the difference between indices in successive rows for each combination
diffCombs = [np.array(i[1:]) - np.array(i[:-1]) for i in indexCombs]
# The only combinations that are valid are when successive row indices differ by 1 or 0
validCombs = [indexCombs[i] for i in range(len(indexCombs)) if np.all(diffCombs[i]**2<=1)]
# Now get the actual values from the triangle for each row combination
valueCombs = [[tri[i][j[i]] for i in range(len(tri))] for j in validCombs]
# Find the sum for each combination
sums = np.sum(valueCombs, axis=1)
# Print the information pertaining to the largest sum
print 'Highest sum: {0}'.format(sums.max())
print 'Combination: {0}'.format(valueCombs[sums.argmax()])
print 'Row indices: {0}'.format(indexCombs[sums.argmax()])
The output is:
Highest sum: 23
Combination: [3, 7, 4, 9]
Row indices: (0, 0, 1, 0)
Unfortunately this is hugely intensive computationally, so it won't work with the large triangle - but there are definitely some concepts and tools that you could extend to try get it to work!

Random valid permutations of an array in Python [duplicate]

This question already has answers here:
Generate a random derangement of a list
(7 answers)
Closed 7 years ago.
Suppose that we are given arrays A and B of positive integers. A and B contain the same integers (the same number of times), so they are naturally the same length.
Consider permutations of two arrays U and V to be valid if U[i] != V[i] for i = 0, 1, ..., len(U) - 1.
We want to find a valid pair of permutations for A and B. However, we want our algorithm to be such that all pairs of valid permutations are equally likely to be returned.
I've been working on this problem today and cannot seem to come up with a sleek solution. Here is my best solution thus far:
import random
def random_valid_permutation(values):
A = values[:]
B = values[:]
while not is_valid_permutation(A, B):
random.shuffle(A)
random.shuffle(B)
return A, B
def is_valid_permutation(A, B):
return all([A[i] != B[i] for i in range(len(A))])
Unfortunately, since this method involves a random shuffle of each array, it could in theory take infinite time to produce a valid output. I have come up with a couple of alternatives that do run in finite (and reasonable) time, but their implementation is much longer.
Does anyone know of a sleek way to solve this problem?

First note that every permutation, A, has the same number of derangements B as any other permutation A. Thus it is enough to generate a single A and then generate random B until you get a match. The probability that a permutation, B, is a derangement of A is known to be (approximately) 1/e (a little better than 1 out of 3) in a way that is essentially independent of the number of items. There is over a 99% probability that you will find a valid B with less than a dozen trials. Unless your list of values is large, fishing using the built-in random.shuffle might be quicker than rolling your own with the overhead of checking with each new placement of an item if it has led to a clash. The following is almost instantaneous with a thousand elements and still only takes about a second or so with a million elements:
import random
def random_valid_permutation(values):
A = values[:]
B = values[:]
random.shuffle(A)
random.shuffle(B)
while not is_valid_permutation(A, B):
random.shuffle(B)
return A, B
def is_valid_permutation(A, B):
return all(A[i] != B[i] for i in range(len(A)))
As an optimization -- I removed [ and ] form the definition of is_valid_permutation() since all can work directly on generator expressions. There is no reason to create the whole list in memory since any clash will typically detected long before the end of the list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.