Quicksort sorts larger numbers faster?

Quicksort sorts larger numbers faster? - python

I was messing around with Python trying to practice my sorting algorithms and found out something interesting.
I have three different pieces of data:
x = number of numbers to sort
y = range the numbers are in (all random generated ints)
z = total time taken to sort
When:
x = 100000 and
y = (0,100000) then
z = 0.94182094911 sec
When:
x = 100000 and
y = (0,100) then
z = 12.4218382537 sec
When:
x = 100000 and
y = (0,10) then
z = 110.267447809 sec
Any ideas?
Code:
import time
import random
import sys
#-----Function definitions
def quickSort(array): #random pivot location quicksort. uses extra memory.
smaller = []
greater = []
if len(array) <= 1:
return array
pivotVal = array[random.randint(0, len(array)-1)]
array.remove(pivotVal)
for items in array:
if items <= pivotVal:
smaller.append(items)
else:
greater.append(items)
return concat(quickSort(smaller), pivotVal, quickSort(greater))
def concat(before, pivot, after):
new = []
for items in before:
new.append(items)
new.append(pivot)
for things in after:
new.append(things)
return new
#-----Variable definitions
list = []
iter = 0
sys.setrecursionlimit(20000)
start = time.clock() #start the clock
#-----Generate the list of numbers to sort
while(iter < 100000):
list.append(random.randint(0,10)) #modify this to change sorting speed
iter = iter + 1
timetogenerate = time.clock() - start #current timer - last timer snapshot
#-----Sort the list of numbers
list = quickSort(list)
timetosort = time.clock() - timetogenerate #current timer - last timer snapshot
#-----Write the list of numbers
file = open("C:\output.txt", 'w')
for items in list:
file.write(str(items))
file.write("\n")
file.close()
timetowrite = time.clock() - timetosort #current timer - last timer snapshot
#-----Print info
print "time to start: " + str(start)
print "time to generate: " + str(timetogenerate)
print "time to sort: " + str(timetosort)
print "time to write: " + str(timetowrite)
totaltime = timetogenerate + timetosort + start
print "total time: " + str(totaltime)
-------------------revised NEW code----------------------------
def quickSort(array): #random pivot location quicksort. uses extra memory.
smaller = []
greater = []
equal = []
if len(array) <= 1:
return array
pivotVal = array[random.randint(0, len(array)-1)]
array.remove(pivotVal)
equal.append(pivotVal)
for items in array:
if items < pivotVal:
smaller.append(items)
elif items > pivotVal:
greater.append(items)
else:
equal.append(items)
return concat(quickSort(smaller), equal, quickSort(greater))
def concat(before, equal, after):
new = []
for items in before:
new.append(items)
for items in equal:
new.append(items)
for items in after:
new.append(items)
return new

I think this has to do with the choice of a pivot. Depending on how your partition step works, if you have a lot of duplicate values, your algorithm can degenerate to quadratic behavior when confronted with many duplicates. For example, suppose that you're trying to quicksort this stream:
[0 0 0 0 0 0 0 0 0 0 0 0 0]
If you aren't careful with how you do the partitioning step, this can degenerate quickly. For example, suppose you pick your pivot as the first 0, leaving you with the array
[0 0 0 0 0 0 0 0 0 0 0 0]
to partition. Your algorithm might say that the smaller values are the array
[0 0 0 0 0 0 0 0 0 0 0 0]
And the larger values are the array
[]
This is the case that causes quicksort to degenerate to O(n2), since each recursive call is only shrinking the size of the input by one (namely, by pulling off the pivot element).
I noticed that in your code, your partitioning step does indeed do this:
for items in array:
if items <= pivotVal:
smaller.append(items)
else:
greater.append(items)
Given a stream that's a whole bunch of copies of the same element, this will put all of them into one array to recursively sort.
Of course, this seems like a ridiculous case - how is this at all connected to reducing the number of values in the array? - but it actually does come up when you're sorting lots of elements that aren't distinct. In particular, after a few passes of the partitioning, you're likely to group together all equal elements, which will bring you into this case.
For a discussion of how to prevent this from happening, there's a really great talk by Bob Sedgewick and Jon Bentley about how to modify the partition step to work quickly when in the presence of duplicate elements. It's connected to Dijkstra's Dutch national flag problem, and their solutions are really clever.
One option that works is to partition the input into three groups - less, equal, and greater. Once you've broken the input up this way, you only need to sort the less and greater groups; the equal groups are already sorted. The above link to the talk shows how to do this more or less in-place, but since you're already using an out-of-place quicksort the fix should be easy. Here's my attempt at it:
for items in array:
if items < pivotVal:
smaller.append(items)
elif items == pivotVal:
equal.append(items)
else:
greater.append(items)
I've never written a line of Python in my life, BTW, so this may be totally illegal syntax. But I hope the idea is clear! :-)

Things we know:
Time complexity for quick sort of unordered array is O(n*logn).
If the array is already sorted, it degrades to O(n^2).
First two statements are not discrete, i.e. the closer an array is to being sorted, the closer is time complexity of quick sort to O(n^2), and reversely as we shuffle it the complexity approaches O(n*logn)
Now, let's look at your experiment:
In all three cases you used the same number of elements. So, our n which you named x is always 100000.
In your first experiment, you used numbers between 0 and 100000, so ideally with a perfect random number generator you'd get mostly different numbers in a relatively unordered list, thus fitting the O(n*logn) complexity case.
In your third experiment, you used numbers between 0 an 10 in a 100000 elements large list. It means that there were quite many duplicates in your list, making it a lot closer to a sorted list than in the first experiment. So, in that case time complexity was much closer to O(n^2).
And with the same large enough n you can say that n*logn > n^2, which you actually confirmed by your experiment.

The quicksort algorithm has a known weakness--it is slower when the data is mostly sorted. When you have 100000 between 0 and 10 they will be closer to being 'mostly sorted' than 100000 numbers in the range of 0 to 100000.

Related

Why running time of better algorithm code is more than primitive algorithm code?

Problem description : Write a function that takes in a non-empty array of integers that are sorted in ascending order and returns a new array of the same length with the squares of the original integers also sorted in ascending order.
I have wrote two Python functions to solve the problem. One is 'sortedSquaredArrayNormal' and another called 'sortedSquaredArrayBetter'. First one has O(nlogn) time complexity and second function has O(n) time complexity I guess. I have also written third function 'test_runtime_compare' that prints each function run time. Below is my code:
import random
import time
def sortedSquaredArrayNormal(array):
square_arr = []
for elem in array:
square_arr.append(elem*elem)
square_arr.sort()
return square_arr
def sortedSquaredArrayBetter(array):
big_index = len(array)-1
small_index = 0
output_arr = [0 for elem in array]
# elements of bigger indices inserted first in output array
for idx in range(len(array)-1, -1, -1):
small_elem = array[small_index]
big_elem = array[big_index]
if(abs(small_elem) > abs(big_elem)):
output_arr[idx] = small_elem * small_elem
small_index += 1 # small index is shifted 1 position to right
else:
output_arr[idx] = big_elem * big_elem
big_index -= 1 # big index is shifted 1 position to left
return output_arr
def test_runtime_compare():
new_arr = [random.randrange(-100, 100) for i in range(100000)]
new_arr.sort()
initial = time.time()
dummy = sortedSquaredArrayNormal(new_arr)
final = time.time()
normal_time = final - initial
print('Normal time: {}'.format(normal_time))
time.sleep(5)
initial = time.time()
new = sortedSquaredArrayBetter(new_arr)
final = time.time()
better_time = final - initial
print('Better time: {}'.format(better_time))
test_runtime_compare()
I got the output:
Normal time: 0.03777050971984863
Better time: 0.11590099334716797
I was expecting 'better time' to be smaller than 'normal time'. But every time I run the code in my machine with larger input array I get 'normal time' less than 'better time'. I can't find the cause. Can anyone help me to understand the cause? Do I have any mistake in complexity analysis?

take in non-empty array of integers that are sorted in ascending order
and returns a new array of the same length with the squares of the
original integers also sorted in ascending order.
That sounds like a very trivial issue, were it not for negative numbers.
For example, for [-3,-2,-1,0,1,2,3], you would have to sort [9,4,1,0,1,4,9].
The order of the squared numbers is very predictable, hence you found an algorithm that does it in O(n).
But maybe the built-in algorithm for sort is also very good at sorting these kinds of very monotone sequences, so it can also do it in O(n) instead of O(n * log(n)), which is for completely random sequences.

recursion vs iteration time complexity

Could anyone explain exactly what's happening under the hood to make the recursive approach in the following problem much faster and efficient in terms of time complexity?
The problem: Write a program that would take an array of integers as input and return the largest three numbers sorted in an array, without sorting the original (input) array.
For example:
Input: [22, 5, 3, 1, 8, 2]
Output: [5, 8, 22]
Even though we can simply sort the original array and return the last three elements, that would take at least O(nlog(n)) time as the fastest sorting algorithm would do just that. So the challenge is to perform better and complete the task in O(n) time.
So I was able to come up with a recursive solution:
def findThreeLargestNumbers(array, largest=[]):
if len(largest) == 3:
return largest
max = array[0]
for i in array:
if i > max:
max = i
array.remove(max)
largest.insert(0, max)
return findThreeLargestNumbers(array, largest)
In which I kept finding the largest number, removing it from the original array, appending it to my empty array, and recursively calling the function again until there are three elements in my array.
However, when I looked at the suggested iterative method, I composed this code:
def findThreeLargestNumbers(array):
sortedLargest = [None, None, None]
for num in array:
check(num, sortedLargest)
return sortedLargest
def check(num, sortedLargest):
for i in reversed(range(len(sortedLargest))):
if sortedLargest[i] is None:
sortedLargest[i] = num
return
if num > sortedLargest[i]:
shift(sortedLargest, i, num)
return
def shift(array, idx, element):
if idx == 0:
array[0] = element
return array
array[0] = array[1]
array[idx-1] = array[idx]
array[idx] = element
return array
Both codes passed successfully all the tests and I was convinced that the iterative approach is faster (even though not as clean..). However, I imported the time module and put the codes to the test by providing an array of one million random integers and calculating how long each solution would take to return back the sorted array of the largest three numbers.
The recursive approach was way much faster (about 9 times faster) than the iterative approach!
Why is that? Even though the recursive approach is traversing the huge array three times and, on top of that, every time it removes an element (which takes O(n) time as all other 999 elements would need to be shifted in the memory), whereas the iterative approach is traversing the input array only once and yes making some operations at every iteration but with a very negligible array of size 3 that wouldn't even take time at all!
I really want to be able to judge and pick the most efficient algorithm for any given problem so any explanation would tremendously help.

Advice for optimization.
Avoid function calls. Avoid creating temporary garbage. Avoid extra comparisons. Have logic that looks at elements as little as possible. Walk through how your code works by hand and look at how many steps it takes.
Your recursive code makes only 3 function calls, and as pointed out elsewhere does an average of 1.5 comparisons per call. (1 while looking for the min, 0.5 while figuring out where to remove the element.)
Your iterative code makes lots of comparisons per element, calls excess functions, and makes calls to things like sorted that create/destroy junk.
Now compare with this iterative solution:
def find_largest(array, limit=3):
if len(array) <= limit:
# Special logic not needed.
return sorted(array)
else:
# Initialize the answer to values that will be replaced.
min_val = min(array[0:limit])
answer = [min_val for _ in range(limit)]
# Now scan for smallest.
for i in array:
if answer[0] < i:
# Sift elements down until we find the right spot.
j = 1
while j < limit and answer[j] < i:
answer[j-1] = answer[j]
j = j+1
# Now insert.
answer[j-1] = i
return answer
There are no function calls. It is possible that you can make up to 6 comparisons per element (verify that answer[0] < i, verify that (j=1) < 3, verify that answer[1] < i, verify that (j=2) < 3, verify that answer[2] < i, then find that (j=3) < 3 is not true). You will hit that worst case if array is sorted. But most of the time you only do the first comparison then move to the next element. No muss, no fuss.
How does it benchmark?
Note that if you wanted the smallest 100 elements, then you'd find it worthwhile to use a smarter data structure such as a heap to avoid the bubble sort.

I am not really confortable with python, but I have a different approach to the problem for what it's worth.
As far as I saw, all solutions posted are O(NM) where N is the length of the array and M the length of the largest elements array.
Because of your specific situation whereN >> M you could say it's O(N), but the longest the inputs the more it will be O(NM)
I agree with #zvone that it seems you have more steps in the iterative solution, which sounds like an valid explanation to your different computing speeds.
Back to my proposal, implements binary search O(N*logM) with recursion:
import math
def binarySearch(arr, target, origin = 0):
"""
Recursive binary search
Args:
arr (list): List of numbers to search in
target (int): Number to search with
Returns:
int: index + 1 from inmmediate lower element to target in arr or -1 if already present or lower than the lowest in arr
"""
half = math.floor((len(arr) - 1) / 2);
if target > arr[-1]:
return origin + len(arr)
if len(arr) == 1 or target < arr[0]:
return -1
if arr[half] < target and arr[half+1] > target:
return origin + half + 1
if arr[half] == target or arr[half+1] == target:
return -1
if arr[half] < target:
return binarySearch(arr[half:], target, origin + half)
if arr[half] > target:
return binarySearch(arr[:half + 1], target, origin)
def findLargestNumbers(array, limit = 3, result = []):
"""
Recursive linear search of the largest values in an array
Args:
array (list): Array of numbers to search in
limit (int): Length of array returned. Default: 3
Returns:
list: Array of max values with length as limit
"""
if len(result) == 0:
result = [float('-inf')] * limit
if len(array) < 1:
return result
val = array[-1]
foundIndex = binarySearch(result, val)
if foundIndex != -1:
result.insert(foundIndex, val)
return findLargestNumbers(array[:-1],limit, result[1:])
return findLargestNumbers(array[:-1], limit,result)
It is quite flexible and might be inspiration for a more elaborated answer.

The recursive solution
The recursive function goes through the list 3 times to fins the largest number and removes the largest number from the list 3 times.
for i in array:
if i > max:
...
and
array.remove(max)
So, you have 3×N comparisons, plus 3x removal. I guess the removal is optimized in C, but there is again about 3×(N/2) comparisons to find the item to be removed.
So, a total of approximately 4.5 × N comparisons.
The other solution
The other solution goes through the list only once, but each time it compares to the three elements in sortedLargest:
for i in reversed(range(len(sortedLargest))):
...
and almost each time it sorts the sortedLargest with these three assignments:
array[0] = array[1]
array[idx-1] = array[idx]
array[idx] = element
So, you are N times:
calling check
creating and reversing a range(3)
accessing sortedLargest[i]
comparing num > sortedLargest[i]
calling shift
comparing idx == 0
and about 2×N/3 times doing:
array[0] = array[1]
array[idx-1] = array[idx]
array[idx] = element
and N/3 times array[0] = element
It is difficult to count, but that is much more than 4.5×N comparisons.

Check if differences between elements already exists in a list

I'm trying to build a heuristic for the simplest feasible Golomb Ruler as possible. From 0 to n, find n numbers such that all the differences between them are different. This heuristic consists of incrementing the ruler by 1 every time. If a difference already exists on a list, jump to the next integer. So the ruler starts with [0,1] and the list of differences = [ 1 ]. Then we try to add 2 to the ruler [0,1,2], but it's not feasible, since the difference (2-1 = 1) already exists in the list of differences. Then we try to add 3 to the ruler [0,1,3] and it is feasible, and thus the list of differences becomes [1,2,3] and so on. Here's what I've come to so far:
n = 5
positions = list(range(1,n+1))
Pos = []
Dist = []
difs = []
i = 0
while (i < len(positions)):
if len(Pos)==0:
Pos.append(0)
Dist.append(0)
elif len(Pos)==1:
Pos.append(1)
Dist.append(1)
else:
postest = Pos + [i] #check feasibility to enter the ruler
difs = [a-b for a in postest for b in postest if a > b]
if any (d in difs for d in Dist)==True:
pass
else:
for d in difs:
Dist.append(d)
Pos.append(i)
i += 1
However I can't make the differences check to work. Any suggestions?

For efficiency I would tend to use a set to store the differences, because they are good for inclusion testing, and you don't care about the ordering (possibly until you actually print them out, at which point you can use sorted).
You can use a temporary set to store the differences between the number that you are testing and the numbers you currently have, and then either add it to the existing set, or else discard it if you find any matches. (Note else block on for loop, that will execute if break was not encountered.)
n = 5
i = 0
vals = []
diffs = set()
while len(vals) < n:
diffs1 = set()
for j in reversed(vals):
diff = i - j
if diff in diffs:
break
diffs1.add(diff)
else:
vals.append(i)
diffs.update(diffs1)
i += 1
print(vals, sorted(diffs))
The explicit loop over values (rather than the use of any) is to avoid unnecessarily calculating the differences between the candidate number and all the existing values, when most candidate numbers are not successful and the loop can be aborted early after finding the first match.
It would work for vals also to be a set and use add instead of append (although similarly, you would probably want to use sorted when printing it). In this case a list is used, and although it does not matter in principle in which order you iterate over it, this code is iterating in reverse order to test the smaller differences first, because the likelihood is that unusable candidates are rejected more quickly this way. Testing it with n=200, the code ran in about 0.2 seconds with reversed and about 2.1 without reversed; the effect is progressively more noticeable as n increases. With n=400, it took 1.7 versus 27 seconds with and without the reversed.

Better python logic that prevent time out when comparing arrays in nested loops

I was attempting to solve a programing challenge and the program i wrote solved the small test data correctly for this question. But When they run it against the larger datasets, my program timed out on some of the occasions . I am mostly a self taught programmer, if there is a better algorithm/implementation than my logic can you guys tell me.thanks.
Question
Given an array of integers, a, return the maximum difference of any
pair of numbers such that the larger integer in the pair occurs at a
higher index (in the array) than the smaller integer. Return -1 if you
cannot find a pair that satisfies this condition.
My Python Function
def maxDifference( a):
diff=0
find=0
leng = len(a)
for x in range(0,leng-1):
for y in range(x+1,leng):
if(a[y]-a[x]>=diff):
diff=a[y]-a[x]
find=1
if find==1:
return diff
else:
return -1
Constraints:
1 <= N <= 1,000,000
-1,000,000 <= a[i] <= 1,000,000 i belongs to [1,N]
Sample Input:
Array { 2,3,10,2,4,8,1}
Sample Output:
8

Well... since you don't care for anything else than finding the highest number following the lowest number, provided that difference is the highest so far, there's no reason to do several passes or using max() over a slice of the array:
def f1(a):
smallest = a[0]
result = 0
for b in a:
if b < smallest:
smallest = b
if b - smallest > result:
result = b - smallest
return result if result > 0 else -1
Thanks #Matthew for the testing code :)
This is very fast even on large sets:
The maximum difference is 99613 99613 99613
Time taken by Sojan's method: 0.0480000972748
Time taken by #Matthews's method: 0.0130000114441
Time taken by #GCord's method: 0.000999927520752

The reason your program takes too long is that your nested loop inherently means quadratic time.
The outer loop goes through N-1 indices. The inner loop goes through a different number of indices each time, but the average is obviously (N-1)/2 rounded up. So, the total number of times through the inner loop is (N-1) * (N-1)/2, which is O(N^2). For the maximum N=1000000, that means 499999000001 iterations. That's going to take a long time.
The trick is to find a way to do this in linear time.
Here's one solution (as a vague description, rather than actual code, so someone can't just copy and paste it when they face the same test as you):
Make a list of the smallest value before each index. Each one is just min(smallest_values[-1], arr[i]), and obviously you can do this in N steps.
Make a list of the largest value after each index. The simplest way to do this is to reverse the list, do the exact same loop as above (but with max instead of min), then reverse again. (Reversing a list takes N steps, of course.)
Now, for each element in the list, instead of comparing to every other element, you just have to compare to smallest_values[i] and largest_values[i]. Since you're only doing 2 comparisons for each of the N values, this takes 2N time.
So, even being lazy and naive, that's a total of N + 3N + 2N steps, which is O(N). If N=1000000, that means 6000000 steps, which is a whole lot faster than 499999000001.
You can obviously see how to remove the two reverses, and how to skip the first and last comparisons. If you're smart, you can see how to take the whole largest_values out of the equation entirely. Ultimately, I think you can get it down to 2N - 3 steps, or 1999997. But that's all just a small constant improvement; nowhere near as important as fixing the basic algorithmic problem. You'd probably get a bigger improvement than 3x (maybe 20x), for less work, by just running the naive code in PyPy instead of CPython, or by converting to NumPy—but you're not going to get the 83333x improvement in any way other than changing the algorithm.

Here's a linear time solution. It keeps a track of the minimum value before each index of the list. These minimum values are stored in a list min_lst. Finally, the difference between corresponding elements of the original and the min list is calculated into another list of differences by zipping the two. The maximum value in this differences list should be the required answer.
def get_max_diff(lst):
min_lst = []
running_min = lst[0]
for item in lst:
if item < running_min:
running_min = item
min_lst.append(running_min)
val = max(x-y for (x, y) in zip(lst, min_lst))
if not val:
return -1
return val
>>> get_max_diff([5, 6, 2, 12, 8, 15])
13
>>> get_max_diff([2, 3, 10, 2, 4, 8, 1])
8
>>> get_max_diff([5, 4, 3, 2, 1])
-1

Well, I figure since someone in the same problem can copy your code and run with that, I won't lose any sleep over them copying some more optimized code:
import time
import random
def max_difference1(a):
# your function
def max_difference2(a):
diff = 0
for i in range(0, len(a)-1):
curr_diff = max(a[i+1:]) - a[i]
diff = max(curr_diff, diff)
return diff if diff != 0 else -1
my_randoms = random.sample(range(100000), 1000)
t01 = time.time()
max_dif1 = max_difference1(my_randoms)
dt1 = time.time() - t01
t02 = time.time()
max_dif2 = max_difference2(my_randoms)
dt2 = time.time() - t02
print("The maximum difference is", max_dif1)
print("Time taken by your method:", dt1)
print("Time taken by my method:", dt2)
print("My method is", dt1/dt2, "times faster.")
The maximum difference is 99895
Time taken by your method: 0.5533690452575684
Time taken by my method: 0.08005285263061523
My method is 6.912546237558299 times faster.
Similar to what #abarnert said (who always snipes me on these things I swear), you don't want to loop over the list twice. You can exploit the fact that you know that your larger value has to be in front of the smaller one. You also can exploit the fact that you don't care for anything except the largest number, that is, in the list [1,3,8,5,9], the maximum difference is 8 (9-1) and you don't care that 3, 8, and 5 are in there. Thus: max(a[i+1:]) - a[i] is the maximum difference for a given index.
Then you compare it with diff, and take the larger of the 2 with max, as calling default built-in python functions is somewhat faster than if curr_diff > diff: diff = curr_diff (or equivalent).
The return line is simply your (fixed) line in 1 line instead of 4
As you can see, in a sample of 1000, this method is ~6x faster (note: used python 3.4, but nothing here would break on python 2.x)

I think the expected answer for
1, 2, 4, 2, 3, 8, 5, 6, 10
will be 8 - 2 = 6 but instead Saksham Varma code will return 10 - 1 = 9.
Its max(arr) - min(arr).
Don't we have to reset the min value when there is a dip
. ie; 4 -> 2 will reset current_smallest = 2 and continue diff the calculation with value '2'.
def f2(a):
current_smallest = a[0]
large_diff = 0
for i in range(1, len(a)):
# Identify the dip
if a[i] < a[i-1]:
current_smallest = a[i]
if a[i] - current_smallest > large_diff:
large_diff = a[i] - current_smallest

Generate 4000 unique pseudo-random cartesian coordinates FASTER?

The range for x and y is from 0 to 99.
I am currently doing it like this:
excludeFromTrainingSet = []
while len(excludeFromTrainingSet) < 4000:
tempX = random.randint(0, 99)
tempY = random.randint(0, 99)
if [tempX, tempY] not in excludeFromTrainingSet:
excludeFromTrainingSet.append([tempX, tempY])
But it takes ages and I really need to speed this up.
Any ideas?

Vincent Savard has an answer that's almost twice as fast as the first solution offered here.
Here's my take on it. It requires tuples instead of lists for hashability:
def method2(size):
ret = set()
while len(ret) < size:
ret.add((random.randint(0, 99), random.randint(0, 99)))
return ret
Just make sure that the limit is sane as other answerers have pointed out. For sane input, this is better algorithmically O(n) as opposed to O(n^2) because of the set instead of list. Also, python is much more efficient about loading locals than globals so always put this stuff in a function.
EDIT: Actually, I'm not sure that they're O(n) and O(n^2) respectively because of the probabilistic component but the estimations are correct if n is taken as the number of unique elements that they see. They'll both be slower as they approach the total number of available spaces. If you want an amount of points which approaches the total number available, then you might be better off using:
import random
import itertools
def method2(size, min_, max_):
range_ = range(min_, max_)
points = itertools.product(range_, range_)
return random.sample(list(points), size)
This will be a memory hog but is sure to be faster as the density of points increases because it avoids looking at the same point more than once. Another option worth profiling (probably better than last one) would be
def method3(size, min_, max_):
range_ = range(min_, max_)
points = list(itertools.product(range_, range_))
N = (max_ - min_)**2
L = N - size
i = 1
while i <= L:
del points[random.randint(0, N - i)]
i += 1
return points

My suggestion :
def method2(size):
randints = range(0, 100)
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < size:
excludeFromTrainingSet.add((random.choice(randints), random.choice(randints)))
return excludeFromTrainingSet
Instead of generation 2 random numbers every time, you first generate the list of numbers from 0 to 99, then you choose 2 and appends to the list. As others pointed out, there are only 10 000 possibilities so you can't loop until you get 40 000, but you get the point.

I'm sure someone is going to come in here with a usage of numpy, but how about using a set and tuple?
E.g.:
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < 40000:
temp = (random.randint(0, 99), random.randint(0, 99))
if temp not in excludeFromTrainingSet:
excludeFromTrainingSet.add(temp)
EDIT: Isn't this an infinite loop since there are only 100^2 = 10000 POSSIBLE results, and you're waiting until you get 40000?

Make a list of all possible (x,y) values:
allpairs = list((x,y) for x in xrange(99) for y in xrange(99))
# or with Py2.6 or later:
from itertools import product
allpairs = list(product(xrange(99),xrange(99)))
# or even taking DRY to the extreme
allpairs = list(product(*[xrange(99)]*2))
Shuffle the list:
from random import shuffle
shuffle(allpairs)
Read off the first 'n' values:
n = 4000
trainingset = allpairs[:n]
This runs pretty snappily on my laptop.

You could make a lookup table of random values... make a random index into that lookup table, and then step through it with a static increment counter...

Generating 40 thousand numbers inevitably will take a while. But you are performing an O(n) linear search on the excludeFromTrainingSet, which takes quite a while especially later in the process. Use a set instead. You could also consider generating a number of coordinate sets e.g. over night and pickle them, so you don't have to generate new data for each test run (dunno what you're doing, so this might or might not help). Using tuples, as someone noted, is not only the semantically correct choice, it might also help with performance (tuple creation is faster than list creation). Edit: Silly me, using tuples is required when using sets, since set members must be hashable and lists are unhashable.
But in your case, your loop isn't terminating because 0..99 is 100 numbers and two-tuples of them have only 100^2 = 10000 unique combinations. Fix that, then apply the above.

Taking Vince Savard's code:
>>> from random import choice
>>> def method2(size):
... randints = range(0, 100)
... excludeFromTrainingSet = set()
... while True:
... x = size - len(excludeFromTrainingSet)
... if not x:
... break
... else:
... excludeFromTrainingSet.add((choice(randints), choice(randints)) for _ in range(x))
... return excludeFromTrainingSet
...
>>> s = method2(4000)
>>> len(s)
4000
This is not a great algorithm because it has to deal with collisions, but the tuple-generation makes it tolerable. This runs in about a second on my laptop.

## for py 3.0+
## generate 4000 points in 2D
##
import random
maxn = 10000
goodguys = 0
excluded = [0 for excl in range(0, maxn)]
for ntimes in range(0, maxn):
alea = random.randint(0, maxn - 1)
excluded[alea] += 1
if(excluded[alea] > 1): continue
goodguys += 1
if goodguys > 4000: break
two_num = divmod(alea, 100) ## Unfold the 2 numbers
print(two_num)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.