Check if differences between elements already exists in a list

Check if differences between elements already exists in a list - python

I'm trying to build a heuristic for the simplest feasible Golomb Ruler as possible. From 0 to n, find n numbers such that all the differences between them are different. This heuristic consists of incrementing the ruler by 1 every time. If a difference already exists on a list, jump to the next integer. So the ruler starts with [0,1] and the list of differences = [ 1 ]. Then we try to add 2 to the ruler [0,1,2], but it's not feasible, since the difference (2-1 = 1) already exists in the list of differences. Then we try to add 3 to the ruler [0,1,3] and it is feasible, and thus the list of differences becomes [1,2,3] and so on. Here's what I've come to so far:
n = 5
positions = list(range(1,n+1))
Pos = []
Dist = []
difs = []
i = 0
while (i < len(positions)):
if len(Pos)==0:
Pos.append(0)
Dist.append(0)
elif len(Pos)==1:
Pos.append(1)
Dist.append(1)
else:
postest = Pos + [i] #check feasibility to enter the ruler
difs = [a-b for a in postest for b in postest if a > b]
if any (d in difs for d in Dist)==True:
pass
else:
for d in difs:
Dist.append(d)
Pos.append(i)
i += 1
However I can't make the differences check to work. Any suggestions?

For efficiency I would tend to use a set to store the differences, because they are good for inclusion testing, and you don't care about the ordering (possibly until you actually print them out, at which point you can use sorted).
You can use a temporary set to store the differences between the number that you are testing and the numbers you currently have, and then either add it to the existing set, or else discard it if you find any matches. (Note else block on for loop, that will execute if break was not encountered.)
n = 5
i = 0
vals = []
diffs = set()
while len(vals) < n:
diffs1 = set()
for j in reversed(vals):
diff = i - j
if diff in diffs:
break
diffs1.add(diff)
else:
vals.append(i)
diffs.update(diffs1)
i += 1
print(vals, sorted(diffs))
The explicit loop over values (rather than the use of any) is to avoid unnecessarily calculating the differences between the candidate number and all the existing values, when most candidate numbers are not successful and the loop can be aborted early after finding the first match.
It would work for vals also to be a set and use add instead of append (although similarly, you would probably want to use sorted when printing it). In this case a list is used, and although it does not matter in principle in which order you iterate over it, this code is iterating in reverse order to test the smaller differences first, because the likelihood is that unusable candidates are rejected more quickly this way. Testing it with n=200, the code ran in about 0.2 seconds with reversed and about 2.1 without reversed; the effect is progressively more noticeable as n increases. With n=400, it took 1.7 versus 27 seconds with and without the reversed.

Related

How to compute derangement (permutation) of a list with repeating elements

I have a list with repeating elements, i.e. orig = [1,1,1,2,2,3].
I want to create a derangement b = f(orig) such that for every location value in b is different from value in orig:
b[i] != orig[i], for all i
I know a solution when all element in orig are unique, but this is a harder case.
Developing a solution in python, but any language will do.

The not so-efficient solution is clearly
import itertools
set([s for s in itertools.permutations(orig) if not any([a == b for a, b in zip(s, orig)])])
A second method and first improvement is using this perm_unique:
[s for s in perm_unique(orig) if not any([a == b for a, b in zip(s, orig)])]
A third method is to use this super quick unique_permutations algorithm.
import copy
[copy.copy(s) for s in unique_permutations(orig) if not any([a == b for a, b in zip(s, orig)])]
In my notebook with %%timeit the initial method takes 841 µs, and we improve to 266 µs and then to 137 µs.
Edit
Couldn't stop considering, made a small edit of the second method. Didn't have the time to dive into the last method. For explanation, first see the original post (link above). Then I only added the check and el != elements[depth] which forces the condition of the derangement. With this we arrive at a running time of 50 µs.
from collections import Counter
def derangement_unique(elements):
list_unique = Counter(elements)
length_list = len(elements) # will become depth in the next function
placeholder = [0]*length_list # will contain the result
return derangement_unique_helper(elements, list_unique, placeholder, length_list-1)
def derangement_unique_helper(elements, list_unique, result_list, depth):
if depth < 0: # arrived at a solution
yield tuple(result_list)
else:
# consider all elements and how many times they should still occur
for el, count in list_unique.items():
# ... still required and not breaking the derangement requirement
if count > 0 and el != elements[depth]:
result_list[depth] = el # assignment element
list_unique[el] -= 1 # substract number needed
# loop for all possible continuations
for g in derangement_unique_helper(elements, list_unique, result_list, depth-1):
yield g
list_unique[el] += 1
list(derangement_unique(orig))

If your list contains significant part of duplicates, it might be hard to find derangement quickly.
In this case you can try graph approach.
Treat initial list to make a graph where every item is connected with non-equal elements (easy for sorted list).
Then build perfect matching (if number of element is even) or near-perfect matching (for odd count, here you'll need find some suitable pair and join single node to it).
Edges of matching indicate swaps to make derangement.
Python library networkx should contain needed methods.

Best way to remove similar points in a list of points

I have a list of points that looks like this:
points = [(54592748,54593510),(54592745,54593512), ...]
Many of these points are similar in the sense that points[n][0] is almost equal to points[m][0] AND points[n][1] is almost equal to points[m][1]. Where 'almost equal' is a whatever integer I decide.
I would like to filter out all the similar points from the list, keeping just one of it.
Here is my code.
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
to_compare = points[:] # make a list of item to compare
to_remove = set() # keep track of items to be removed
for point in points:
to_compare.remove(point) # do not compare with itself
for other_point in to_compare:
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(other_point)
for point in to_remove:
points.remove(point)
It works...
>>>points
[(54592748, 54593510), (117628626, 117630648), (1354358, 1619520)]
but I am looking for a faster solution since my list is millions items long.
PyPy helped a lot, it speeded up 6 the whole process 6 times, but probably there is a more efficient way to do this in the first place, or not?
Any help is very welcome.
=======
UPDATE
I have tested some of the answers with the points object you can pickle.load() from here https://mega.nz/#!TVci1KDS!tE5fTnjpPwbvpFTmW1TLsVXDvYHbRF8F7g10KGdOPCs
My code takes 1104 seconds and reduces the list to 96428 points (from 99920).
David's code do the job in 14 seconds! But misses something, 96431 points left.
Martin's code takes 0.06 seconds!! But also misses something, 96462 points left.
Any clue about why the results are not the same?

Depending on how accurate you need this to be, the following approach should work well:
points = [(54592748, 54593510), (54592745, 54593512), (117628626, 117630648), (1354358, 1619520), (54592746, 54593509)]
d = 20
hpoints = {((x - (x % d)), (y - (y % d))) : (x,y) for x, y in points}
for x in hpoints.itervalues():
print x
This converts each point into a dictionary key with each x and y coordinate rounded by its modulus. The result is a dictionary holding the coordinate of the last point in a given area. For the data you have given, this would display the following:
(117628626, 117630648)
(54592746, 54593509)
(1354358, 1619520)

Sorting the list first avoids the inner for loop and thus the n^2 time. I'm not sure if it's practically any quicker though since I don't have your full data. Try this (it outputs the same as far as i can see from your example points, just ordered).
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
points.sort()
to_remove = set() # keep track of items to be removed
for i, point in enumerate(points):
if i == len(points) - 1:
break
other_point = points[i+1]
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(point)
for point in to_remove:
points.remove(point)
print(points)

This function for getting unique items from a list (it isn't mine, I found it a while back) only loops over the list once (plus dictionary lookups).
def unique(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
# in old Python versions:
# if seen.has_key(marker)
# but in new ones:
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
The id function will require some cleverness. point[0] is divided by error and floored to an integer. So all point[0]'s such that x*error <= point[0] < (x+1)*error are the same and similarly for point[1].
def id(point):
error = 4
x = point[0]//error
y = point[1]//error
idValue = str(x)+"//"+str(y)
return idValue
So these functions will reduce points between consecutive multiples of error to the same point. The good news is that it only touches the original list once plus the dictionary lookups. The bad news is that this id function won't catch for example 15 and 17 should be the same because 15 reduces to 3 and 17 reduces to 4. It is possible that will some cleverness, this issue could be resolved.
[NOTE: I originally used exponents of primes for the idValue, but the exponents would be way to large. If you could make the idValue an int, that would increase lookup speed ]

Best way to compare two large sets of strings in Python

I am using Python (and have access to pandas, numpy, scipy).
I have two sets strings set A and set B. Each set A and B contains c. 2000 elements (each element being a string). The strings are around 50-100 characters long comprising up to c. 20 words (these sets may get much larger).
I wish to check if an member of set A is also a member of set B.
Now I am thinking a naive implementation can be visualised as a matrix where members in A and B are compared to one another (e.g. A1 == B1, A1 == B2, A1 == B3 and so on...) and the booleans (0, 1) from the comparison comprise the elements of the matrix.
What is the best way to implement this efficiently?
Two further elaborations:
(i) I am also thinking that for larger sets I may use a Bloom Filter (e.g. using PyBloom, pybloomfilter) to hash each string (i.e. I dont mind fasle positives so much...). Is this a good approach or are there other strategies I should consider?
(ii) I am thinking of including a Levenshtein distance match between strings (which I know can be slow) as I may need fuzzy matches - is there a way of combining this with the approach in (i) or otherwise making it more efficient?
Thanks in advance for any help!

Firstly, 2000 * 100 chars is'nt that big, you could use a set directly.
Secondly, if your strings are sorted, there is a quick way (which I found here)to compare them, as follows:
def compare(E1, E2):
i, j = 0, 0
I, J = len(E1), len(E2)
while i < I:
if j >= J or E1[i] < E2[j]:
print(E1[i], "is not in E2")
i += 1
elif E1[i] == E2[j]:
print(E1[i], "is in E2")
i, j = i + 1, j + 1
else:
j += 1
It is certainly slower than using a set, but it doesn't need the strings to be hold into memory (only two are needed at the same time).
For the Levenshtein thing, there is a C module which you can find on Pypi, and which is quite fast.

As mentioned in the comments:
def compare(A, B):
return list(set(A).intersection(B))

This is a modified version of the function that #michaelmeyer presented here https://stackoverflow.com/a/17264117/362951 - in his answer to the question on top of the page we are on.
The modified version below works also on unsorted data, because the function now includes the sorting.
This should not be a performance or resource problem in many cases, because python sorting is very effective. And presorting also helps.
Please note that the 'output' is now in sorted order too. This will differ from the original order of the first parameter, if it was unsorted.
Otherwise the sorting won't hurt much, even if both data sets are already sorted.
But if you want to suppress the sorting, in case both data sets are known to be sorted in ascending order already, call it like this:
compare(my_data1,my_data2,data_is_sorted=True)
Otherwise:
compare(my_data1,my_data2)
and the function accepts unordered data.
This is the modified version. Only the first two lines were added and a third optional parameter:
def compare(E1, E2, data_is_sorted=False):
if not data_is_sorted:
E1=sorted(E1)
E2=sorted(E2)
i, j = 0, 0
I, J = len(E1), len(E2)
while i < I:
if j >= J or E1[i] < E2[j]:
print(E1[i], "is not in E2")
i += 1
elif E1[i] == E2[j]:
print(E1[i], "is in E2")
i, j = i + 1, j + 1
else:
j += 1

longest common sub-string, Python complexity analysis

I built a function which finds the longest common sub-string of two text files in ascending order based on Rabin–Karp algorithm.
the main function is "find_longest" and the inner functions are: "make_hashtable","extend_fingerprints" and "has_match".
I'm having trouble analyzing the average case complexity of has_match.
Denote n1,n2 as text1,text2 and l as the size of the currunt "window".
fingers are the hash table of the substring.
def has_match(text1,text2,fingers1,fingers2,l,r):
h = make_hashtable(fingers2,r)
for i in range(len(fingers1)):
for j in h[fingers1[i]]:
if text1[i:i+l] == text2[j:j+l]:
return text1[i:i+l]
return None
this is "make_hashtable", here I'm pretty sure that the complexcity is O(n2-l+1):
def make_hashtable(fingers, table_size):
hash_table=[[] for i in range(table_size)]
count=0
for f in fingers:
hash_table[f].append(count)
count+=1
return hash_table
this is "find_longest", im adding this function despite the fact that i dont need it for the complexity analyzing.
def find_longest(text1,text2,basis=2**8,r=2**17-1):
match = ''
l = 0 #initial "window" size
#fingerprints of "windows" of size 0 - all are 0
fingers1 = [0]*(len(text1)+1)
fingers2 = [0]*(len(text2)+1)
while match != None: #there was a common substring of len l
l += 1
extend_fingerprints(text1, fingers1, l, basis, r)
extend_fingerprints(text2, fingers2, l, basis, r)
match = has_match(text1,text2,fingers1,fingers2,l,r)
print(match)
return l-1
and this is "extend_fingerprints":
def extend_fingerprints(text, fingers, length, basis=2**8, r=2**17-1):
count=0
for f in fingers:
if count==len(fingers)-1:
fingers.pop(len(fingers)-1)
break
fingers[count]=(f*basis+ord(text[length-1+count]))%r
count+=1
I'm having doubts between this two options:
1.O(n_2-l+1)+O(n_1-l+1)*O(l)
Refer to r as a constant number while n1,n2 are very large therefore a lot of collisions would be made at the hash table (let's say O(1) items at every 'cell', yet, always some "false-positives")
2.O(n_2-l+1)+O(n_1-l+1)+O(l)
Refer to r as optimal for a decent hash function, therefore almost no collisions which means that if two texts are the same cell at the hash table we may assume they are actually the same text?
Personally I lean towards the Bold statement.
tnx.

I think the answer is
O((n_2-l) + l*(n_1-l))
.
(n_2-l) represents the complexity of make_hashtable for the second text.
l*(n_1-l) represents the two nested loops who go through every item in the finger prints of the first text and perform 1 comparison operation (for l length slice), for some constant 'm' if there are some items of the same index in the hash table.

Quicksort sorts larger numbers faster?

I was messing around with Python trying to practice my sorting algorithms and found out something interesting.
I have three different pieces of data:
x = number of numbers to sort
y = range the numbers are in (all random generated ints)
z = total time taken to sort
When:
x = 100000 and
y = (0,100000) then
z = 0.94182094911 sec
When:
x = 100000 and
y = (0,100) then
z = 12.4218382537 sec
When:
x = 100000 and
y = (0,10) then
z = 110.267447809 sec
Any ideas?
Code:
import time
import random
import sys
#-----Function definitions
def quickSort(array): #random pivot location quicksort. uses extra memory.
smaller = []
greater = []
if len(array) <= 1:
return array
pivotVal = array[random.randint(0, len(array)-1)]
array.remove(pivotVal)
for items in array:
if items <= pivotVal:
smaller.append(items)
else:
greater.append(items)
return concat(quickSort(smaller), pivotVal, quickSort(greater))
def concat(before, pivot, after):
new = []
for items in before:
new.append(items)
new.append(pivot)
for things in after:
new.append(things)
return new
#-----Variable definitions
list = []
iter = 0
sys.setrecursionlimit(20000)
start = time.clock() #start the clock
#-----Generate the list of numbers to sort
while(iter < 100000):
list.append(random.randint(0,10)) #modify this to change sorting speed
iter = iter + 1
timetogenerate = time.clock() - start #current timer - last timer snapshot
#-----Sort the list of numbers
list = quickSort(list)
timetosort = time.clock() - timetogenerate #current timer - last timer snapshot
#-----Write the list of numbers
file = open("C:\output.txt", 'w')
for items in list:
file.write(str(items))
file.write("\n")
file.close()
timetowrite = time.clock() - timetosort #current timer - last timer snapshot
#-----Print info
print "time to start: " + str(start)
print "time to generate: " + str(timetogenerate)
print "time to sort: " + str(timetosort)
print "time to write: " + str(timetowrite)
totaltime = timetogenerate + timetosort + start
print "total time: " + str(totaltime)
-------------------revised NEW code----------------------------
def quickSort(array): #random pivot location quicksort. uses extra memory.
smaller = []
greater = []
equal = []
if len(array) <= 1:
return array
pivotVal = array[random.randint(0, len(array)-1)]
array.remove(pivotVal)
equal.append(pivotVal)
for items in array:
if items < pivotVal:
smaller.append(items)
elif items > pivotVal:
greater.append(items)
else:
equal.append(items)
return concat(quickSort(smaller), equal, quickSort(greater))
def concat(before, equal, after):
new = []
for items in before:
new.append(items)
for items in equal:
new.append(items)
for items in after:
new.append(items)
return new

I think this has to do with the choice of a pivot. Depending on how your partition step works, if you have a lot of duplicate values, your algorithm can degenerate to quadratic behavior when confronted with many duplicates. For example, suppose that you're trying to quicksort this stream:
[0 0 0 0 0 0 0 0 0 0 0 0 0]
If you aren't careful with how you do the partitioning step, this can degenerate quickly. For example, suppose you pick your pivot as the first 0, leaving you with the array
[0 0 0 0 0 0 0 0 0 0 0 0]
to partition. Your algorithm might say that the smaller values are the array
[0 0 0 0 0 0 0 0 0 0 0 0]
And the larger values are the array
[]
This is the case that causes quicksort to degenerate to O(n2), since each recursive call is only shrinking the size of the input by one (namely, by pulling off the pivot element).
I noticed that in your code, your partitioning step does indeed do this:
for items in array:
if items <= pivotVal:
smaller.append(items)
else:
greater.append(items)
Given a stream that's a whole bunch of copies of the same element, this will put all of them into one array to recursively sort.
Of course, this seems like a ridiculous case - how is this at all connected to reducing the number of values in the array? - but it actually does come up when you're sorting lots of elements that aren't distinct. In particular, after a few passes of the partitioning, you're likely to group together all equal elements, which will bring you into this case.
For a discussion of how to prevent this from happening, there's a really great talk by Bob Sedgewick and Jon Bentley about how to modify the partition step to work quickly when in the presence of duplicate elements. It's connected to Dijkstra's Dutch national flag problem, and their solutions are really clever.
One option that works is to partition the input into three groups - less, equal, and greater. Once you've broken the input up this way, you only need to sort the less and greater groups; the equal groups are already sorted. The above link to the talk shows how to do this more or less in-place, but since you're already using an out-of-place quicksort the fix should be easy. Here's my attempt at it:
for items in array:
if items < pivotVal:
smaller.append(items)
elif items == pivotVal:
equal.append(items)
else:
greater.append(items)
I've never written a line of Python in my life, BTW, so this may be totally illegal syntax. But I hope the idea is clear! :-)

Things we know:
Time complexity for quick sort of unordered array is O(n*logn).
If the array is already sorted, it degrades to O(n^2).
First two statements are not discrete, i.e. the closer an array is to being sorted, the closer is time complexity of quick sort to O(n^2), and reversely as we shuffle it the complexity approaches O(n*logn)
Now, let's look at your experiment:
In all three cases you used the same number of elements. So, our n which you named x is always 100000.
In your first experiment, you used numbers between 0 and 100000, so ideally with a perfect random number generator you'd get mostly different numbers in a relatively unordered list, thus fitting the O(n*logn) complexity case.
In your third experiment, you used numbers between 0 an 10 in a 100000 elements large list. It means that there were quite many duplicates in your list, making it a lot closer to a sorted list than in the first experiment. So, in that case time complexity was much closer to O(n^2).
And with the same large enough n you can say that n*logn > n^2, which you actually confirmed by your experiment.

The quicksort algorithm has a known weakness--it is slower when the data is mostly sorted. When you have 100000 between 0 and 10 they will be closer to being 'mostly sorted' than 100000 numbers in the range of 0 to 100000.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.