Generate 4000 unique pseudo-random cartesian coordinates FASTER? - python

The range for x and y is from 0 to 99.
I am currently doing it like this:
excludeFromTrainingSet = []
while len(excludeFromTrainingSet) < 4000:
tempX = random.randint(0, 99)
tempY = random.randint(0, 99)
if [tempX, tempY] not in excludeFromTrainingSet:
excludeFromTrainingSet.append([tempX, tempY])
But it takes ages and I really need to speed this up.
Any ideas?

Vincent Savard has an answer that's almost twice as fast as the first solution offered here.
Here's my take on it. It requires tuples instead of lists for hashability:
def method2(size):
ret = set()
while len(ret) < size:
ret.add((random.randint(0, 99), random.randint(0, 99)))
return ret
Just make sure that the limit is sane as other answerers have pointed out. For sane input, this is better algorithmically O(n) as opposed to O(n^2) because of the set instead of list. Also, python is much more efficient about loading locals than globals so always put this stuff in a function.
EDIT: Actually, I'm not sure that they're O(n) and O(n^2) respectively because of the probabilistic component but the estimations are correct if n is taken as the number of unique elements that they see. They'll both be slower as they approach the total number of available spaces. If you want an amount of points which approaches the total number available, then you might be better off using:
import random
import itertools
def method2(size, min_, max_):
range_ = range(min_, max_)
points = itertools.product(range_, range_)
return random.sample(list(points), size)
This will be a memory hog but is sure to be faster as the density of points increases because it avoids looking at the same point more than once. Another option worth profiling (probably better than last one) would be
def method3(size, min_, max_):
range_ = range(min_, max_)
points = list(itertools.product(range_, range_))
N = (max_ - min_)**2
L = N - size
i = 1
while i <= L:
del points[random.randint(0, N - i)]
i += 1
return points

My suggestion :
def method2(size):
randints = range(0, 100)
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < size:
excludeFromTrainingSet.add((random.choice(randints), random.choice(randints)))
return excludeFromTrainingSet
Instead of generation 2 random numbers every time, you first generate the list of numbers from 0 to 99, then you choose 2 and appends to the list. As others pointed out, there are only 10 000 possibilities so you can't loop until you get 40 000, but you get the point.

I'm sure someone is going to come in here with a usage of numpy, but how about using a set and tuple?
E.g.:
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < 40000:
temp = (random.randint(0, 99), random.randint(0, 99))
if temp not in excludeFromTrainingSet:
excludeFromTrainingSet.add(temp)
EDIT: Isn't this an infinite loop since there are only 100^2 = 10000 POSSIBLE results, and you're waiting until you get 40000?

Make a list of all possible (x,y) values:
allpairs = list((x,y) for x in xrange(99) for y in xrange(99))
# or with Py2.6 or later:
from itertools import product
allpairs = list(product(xrange(99),xrange(99)))
# or even taking DRY to the extreme
allpairs = list(product(*[xrange(99)]*2))
Shuffle the list:
from random import shuffle
shuffle(allpairs)
Read off the first 'n' values:
n = 4000
trainingset = allpairs[:n]
This runs pretty snappily on my laptop.

You could make a lookup table of random values... make a random index into that lookup table, and then step through it with a static increment counter...

Generating 40 thousand numbers inevitably will take a while. But you are performing an O(n) linear search on the excludeFromTrainingSet, which takes quite a while especially later in the process. Use a set instead. You could also consider generating a number of coordinate sets e.g. over night and pickle them, so you don't have to generate new data for each test run (dunno what you're doing, so this might or might not help). Using tuples, as someone noted, is not only the semantically correct choice, it might also help with performance (tuple creation is faster than list creation). Edit: Silly me, using tuples is required when using sets, since set members must be hashable and lists are unhashable.
But in your case, your loop isn't terminating because 0..99 is 100 numbers and two-tuples of them have only 100^2 = 10000 unique combinations. Fix that, then apply the above.

Taking Vince Savard's code:
>>> from random import choice
>>> def method2(size):
... randints = range(0, 100)
... excludeFromTrainingSet = set()
... while True:
... x = size - len(excludeFromTrainingSet)
... if not x:
... break
... else:
... excludeFromTrainingSet.add((choice(randints), choice(randints)) for _ in range(x))
... return excludeFromTrainingSet
...
>>> s = method2(4000)
>>> len(s)
4000
This is not a great algorithm because it has to deal with collisions, but the tuple-generation makes it tolerable. This runs in about a second on my laptop.

## for py 3.0+
## generate 4000 points in 2D
##
import random
maxn = 10000
goodguys = 0
excluded = [0 for excl in range(0, maxn)]
for ntimes in range(0, maxn):
alea = random.randint(0, maxn - 1)
excluded[alea] += 1
if(excluded[alea] > 1): continue
goodguys += 1
if goodguys > 4000: break
two_num = divmod(alea, 100) ## Unfold the 2 numbers
print(two_num)

Related

Guidance on removing a nested for loop from function

I'm trying to write the fastest algorithm possible to return the number of "magic triples" (i.e. x, y, z where z is a multiple of y and y is a multiple of x) in a list of 3-2000 integers.
(Note: I believe the list was expected to be sorted and unique but one of the test examples given was [1,1,1] with the expected result of 1 - that is a mistake in the challenge itself though because the definition of a magic triple was explicitly noted as x < y < z, which [1,1,1] isn't. In any case, I was trying to optimise an algorithm for sorted lists of unique integers.)
I haven't been able to work out a solution that doesn't include having three consecutive loops and therefore being O(n^3). I've seen one online that is O(n^2) but I can't get my head around what it's doing, so it doesn't feel right to submit it.
My code is:
def solution(l):
if len(l) < 3:
return 0
elif l == [1,1,1]:
return 1
else:
halfway = int(l[-1]/2)
quarterway = int(halfway/2)
quarterIndex = 0
halfIndex = 0
for i in range(len(l)):
if l[i] >= quarterway:
quarterIndex = i
break
for i in range(len(l)):
if l[i] >= halfway:
halfIndex = i
break
triples = 0
for i in l[:quarterIndex+1]:
for j in l[:halfIndex+1]:
if j != i and j % i == 0:
multiple = 2
while (j * multiple) <= l[-1]:
if j * multiple in l:
triples += 1
multiple += 1
return triples
I've spent quite a lot of time going through examples manually and removing loops through unnecessary sections of the lists but this still completes a list of 2,000 integers in about a second where the O(n^2) solution I found completes the same list in 0.6 seconds - it seems like such a small difference but obviously it means mine takes 60% longer.
Am I missing a really obvious way of removing one of the loops?
Also, I saw mention of making a directed graph and I see the promise in that. I can make the list of first nodes from the original list with a built-in function, so in principle I presume that means I can make the overall graph with two for loops and then return the length of the third node list, but I hit a wall with that too. I just can't seem to make progress without that third loop!!
from array import array
def num_triples(l):
n = len(l)
pairs = set()
lower_counts = array("I", (0 for _ in range(n)))
upper_counts = lower_counts[:]
for i in range(n - 1):
lower = l[i]
for j in range(i + 1, n):
upper = l[j]
if upper % lower == 0:
lower_counts[i] += 1
upper_counts[j] += 1
return sum(nx * nz for nz, nx in zip(lower_counts, upper_counts))
Here, lower_counts[i] is the number of pairs of which the ith number is the y, and z is the other number in the pair (i.e. the number of different z values for this y).
Similarly, upper_counts[i] is the number of pairs of which the ith number is the y, and x is the other number in the pair (i.e. the number of different x values for this y).
So the number of triples in which the ith number is the y value is just the product of those two numbers.
The use of an array here for storing the counts is for scalability of access time. Tests show that up to n=2000 it makes negligible difference in practice, and even up to n=20000 it only made about a 1% difference to the run time (compared to using a list), but it could in principle be the fastest growing term for very large n.
How about using itertools.combinations instead of nested for loops? Combined with list comprehension, it's cleaner and much faster. Let's say l = [your list of integers] and let's assume it's already sorted.
from itertools import combinations
def div(i,j,k): # this function has the logic
return l[k]%l[j]==l[j]%l[i]==0
r = sum([div(i,j,k) for i,j,k in combinations(range(len(l)),3) if i<j<k])
#alaniwi provided a very smart iterative solution.
Here is a recursive solution.
def find_magicals(lst, nplet):
"""Find the number of magical n-plets in a given lst"""
res = 0
for i, base in enumerate(lst):
# find all the multiples of current base
multiples = [num for num in lst[i + 1:] if not num % base]
res += len(multiples) if nplet <= 2 else find_magicals(multiples, nplet - 1)
return res
def solution(lst):
return find_magicals(lst, 3)
The problem can be divided into selecting any number in the original list as the base (i.e x), how many du-plets we can find among the numbers bigger than the base. Since the method to find all du-plets is the same as finding tri-plets, we can solve the problem recursively.
From my testing, this recursive solution is comparable to, if not more performant than, the iterative solution.
This answer was the first suggestion by #alaniwi and is the one I've found to be the fastest (at 0.59 seconds for a 2,000 integer list).
def solution(l):
n = len(l)
lower_counts = dict((val, 0) for val in l)
upper_counts = lower_counts.copy()
for i in range(n - 1):
lower = l[i]
for j in range(i + 1, n):
upper = l[j]
if upper % lower == 0:
lower_counts[lower] += 1
upper_counts[upper] += 1
return sum((lower_counts[y] * upper_counts[y] for y in l))
I think I've managed to get my head around it. What it is essentially doing is comparing each number in the list with every other number to see if the smaller is divisible by the larger and makes two dictionaries:
One with the number of times a number is divisible by a larger
number,
One with the number of times it has a smaller number divisible by
it.
You compare the two dictionaries and multiply the values for each key because the key having a 0 in either essentially means it is not the second number in a triple.
Example:
l = [1,2,3,4,5,6]
lower_counts = {1:5, 2:2, 3:1, 4:0, 5:0, 6:0}
upper_counts = {1:0, 2:1, 3:1, 4:2, 5:1, 6:3}
triple_tuple = ([1,2,4], [1,2,6], [1,3,6])

Randomly sampling numerals from a list whose aggregate needs to be at least greater than a given benchmark

I have a list of tuples formed by 1000 object ids and their scores, i.e.:
scored_items = [('14',534.9),('4',86.0),('78',543.21),....].
Let T be the aggregated score of the top 20 highest scoring items.
That's easy. Using python:
top_20 = sorted(score_items, key=lambda k: k[1],reverse = True)[:20]
T = sum(n for _, n in top_20)
Next, let t equal a quarter of T. I.e. in python: t = math.ceil(T/4)
My question is: what's the most efficient way to randomly select 20 items (without replacement) from scored_items such that their aggregated score is equal to or greater than (but never lower than) t? They may or may not include items from top_20.
Would prefer an answer in Python, and would prefer to not rely on external libraries much
Background: This is an item-ranking algorithm that is strategy proof according to an esoteric - but useful - Game Theory theorem. Source: section 2.5 in this paper, or just read footnote 18 on page 11 of this same link. Btw strategy proof essentially means it's tough to game it.
I'm a neophyte python programmer and have been mulling how to solve this problem for a while now, but just can't seem to wrap my head around it. Would be great to know how the experts would approach and solve this.
I suppose the most simplistic (and least performant perhaps) way is to keep randomly generating sets of 20 items till their scores' sum exceeds or equals t.
But there has to be a better way to do this right?
Here is an implementation of what I mentioned in the comments.
Since we want items such that the sum of the scores is large, we can weight the choice so that we are more likely to pick samples with large scores.
import numpy as np
import math
def normalize(p):
return p/sum(p)
def get_sample(scored_items, N=20, max_iter = 1000):
topN = sorted(scored_items, key=lambda k: k[1],reverse = True)[:N]
T = sum(n for _, n in topN)
t = math.ceil(T/4)
i = 0
scores = np.array([x[1] for x in scored_items])
p=normalize(scores)
while i < max_iter:
sample_indexes = np.random.choice(a=range(len(ids)), size=N, replace=False, p=p)
sample = [scored_items[x] for x in sample_indexes]
if sum(n for _, n in sample) >= t:
print("Found a solution at iteration %d"%i)
return sample
i+=1
print("Could not find a solution after %d iterations"%max_iter)
return None
An example of how to use it:
np.random.seed(0)
ids = range(1000)
scores = 10000*np.random.random_sample(size=len(ids))
scored_items = list(zip(map(str, ids), scores))
sample = get_sample(scored_items, 20)
#Found a solution at iteration 0
print(sum(n for _, n in sample))
#139727.1229832652
Though this is not guaranteed to get a solution, I ran this in a loop 100 times and each time a distinct solution was found on the first iteration.
Though I do not know of a efficient way for huge lists something like this works even for 1000 or so items. You can do a bit better if you don't need True randomness
import random
testList = [x for x in range(1,1000)]
T = sum(range(975, 1000))/4
while True:
rs = random.sample(testList, 15)
if sum(rs) >= t: break
print rs

Constraining random number generation in Python

I am trying to create a loop in Python with numpy that will give me a variable "times" with 5 numbers generated randomly between 0 and 20. However, I want there to be one condition: that none of the differences between two adjacent elements in that list are less than 1. What is the best way to achieve this? I tried with the last two lines of code, but this is most likely wrong.
for j in range(1,6):
times = np.random.rand(1, 5) * 20
times.sort()
print times
da = np.diff(times)
if da.sum < 1: break
For instance, for one iteration, this would not be good:
4.25230915 4.36463992 10.35915732 12.39446368 18.46893283
But something like this would be perfect:
1.47166904 6.85610453 10.81431629 12.10176092 15.53569052
Since you are using numpy, you might as well use the built-in functions for uniform random numbers.
def uniform_min_range(a, b, n, min_dist):
while True:
x = np.random.uniform(a, b, size=n)
np.sort(x)
if np.all(np.diff(x) >= min_dist):
return x
It uses the same trial-and-error approach as the previous answer, so depending on the parameters the time to find a solution can be large.
Use a hit and miss approach to guarantee uniform distribution. Here is a straight-Python implementation which should be tweakable for numpy:
import random
def randSpacedPoints(n,a,b,minDist):
#draws n random numbers in [a,b]
# with property that their distance apart is >= minDist
#uses a hit-miss approach
while True:
nums = [a + (b-a)*random.random() for i in range(n)]
nums.sort()
if all(nums[i] + minDist < nums[i+1] for i in range(n-1)):
return nums
For example,
>>> randSpacedPoints(5,0,20,1)
[0.6681336968970486, 6.882374558960349, 9.73325447748434, 11.774594560239493, 16.009157676493903]
If there is no feasible solution this will hang in an infinite loop (so you might want to add a safety parameter which controls the number of trials).

Pick N distinct items at random from sequence of unknown length, in only one iteration

I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.
I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):
import random
items = range(1, 10) # Imagine this is a huge sequence of unknown length
count = 1
selected = None
for item in items:
if random.random() * count < 1:
selected = item
count += 1
But how can I achieve the same thing for other values of N (say, N=3)?
If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle:
import random
arr=[1,2,3,4]
# In-place shuffle
random.shuffle(arr)
# Take the first 2 elements of the now randomized array
print arr[0:2]
[1, 3]
Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence) on it, but this will work regardless of the types of the objects in your sequence.
Naturally, if you can't fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.
Use reservoir sampling. It's a very simple algorithm that works for any N.
Here is one Python implementation, and here is another.
Simplest I've found is this answer in SO, improved a bit below:
import random
my_list = [1, 2, 3, 4, 5]
how_big = 2
new_list = random.sample(my_list, how_big)
# To preserve the order of the list, you could do:
randIndex = random.sample(range(len(my_list)), how_big)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]
If you have python version of 3.6+ you can use choices
from random import choices
items = range(1, 10)
new_items = choices(items, k = 3)
print(new_items)
[6, 3, 1]
#NPE is correct, but the implementations that are being linked to are sub-optimal and not very "pythonic". Here's a better implementation:
def sample(iterator, k):
"""
Samples k elements from an iterable object.
:param iterator: an object that is iterable
:param k: the number of items to sample
"""
# fill the reservoir to start
result = [next(iterator) for _ in range(k)]
n = k - 1
for item in iterator:
n += 1
s = random.randint(0, n)
if s < k:
result[s] = item
return result
Edit As #panda-34 pointed out the original version was flawed, but not because I was using randint vs randrange. The issue is that my initial value for n didn't account for the fact that randint is inclusive on both ends of the range. Taking this into account fixes the issue. (Note: you could also use randrange since it's inclusive on the minimum value and exclusive on the maximum value.)
Following will give you N random items from an array X
import random
list(map(lambda _: random.choice(X), range(N)))
It should be enough to accept or reject each new item just once, and, if you accept it, throw out a randomly chosen old item.
Suppose you have selected N items of K at random and you see a (K+1)th item. Accept it with probability N/(K+1) and its probabilities are OK. The current items got in with probability N/K, and get thrown out with probability (N/(K+1))(1/N) = 1/(K+1) so survive through with probability (N/K)(K/(K+1)) = N/(K+1) so their probabilities are OK too.
And yes I see somebody has pointed you to reservoir sampling - this is one explanation of how that works.
As aix mentioned reservoir sampling works. Another option is generate a random number for every number you see and select the top k numbers.
To do it iteratively, maintain a heap of k (random number, number) pairs and whenever you see a new number insert to the heap if it is greater than smallest value in the heap.
This was my answer to a duplicate question (closed before I could post) that was somewhat related ("generating random numbers without any duplicates"). Since, it is a different approach than the other answers, I'll leave it here in case it provides additional insight.
from random import randint
random_nums = []
N = # whatever number of random numbers you want
r = # lower bound of number range
R = # upper bound of number range
x = 0
while x < N:
random_num = randint(r, R) # inclusive range
if random_num in random_nums:
continue
else:
random_nums.append(random_num)
x += 1
The reason for the while loop over the for loop is that it allows for easier implementation of non-skipping in random generation (i.e. if you get 3 duplicates, you won't get N-3 numbers).
There's one implementation from the numpy library.
Assuming that N is smaller than the length of the array, you'd have to do the following:
# my_array is the array to be sampled from
assert N <= len(my_array)
indices = np.random.permutation(N) # Generates shuffled indices from 0 to N-1
sampled_array = my_array[indices]
If you need to sample the whole array and not just the first N positions, then you can use:
import random
sampled_array = my_array[random.sample(len(my_array), N)]

Quicksort sorts larger numbers faster?

I was messing around with Python trying to practice my sorting algorithms and found out something interesting.
I have three different pieces of data:
x = number of numbers to sort
y = range the numbers are in (all random generated ints)
z = total time taken to sort
When:
x = 100000 and
y = (0,100000) then
z = 0.94182094911 sec
When:
x = 100000 and
y = (0,100) then
z = 12.4218382537 sec
When:
x = 100000 and
y = (0,10) then
z = 110.267447809 sec
Any ideas?
Code:
import time
import random
import sys
#-----Function definitions
def quickSort(array): #random pivot location quicksort. uses extra memory.
smaller = []
greater = []
if len(array) <= 1:
return array
pivotVal = array[random.randint(0, len(array)-1)]
array.remove(pivotVal)
for items in array:
if items <= pivotVal:
smaller.append(items)
else:
greater.append(items)
return concat(quickSort(smaller), pivotVal, quickSort(greater))
def concat(before, pivot, after):
new = []
for items in before:
new.append(items)
new.append(pivot)
for things in after:
new.append(things)
return new
#-----Variable definitions
list = []
iter = 0
sys.setrecursionlimit(20000)
start = time.clock() #start the clock
#-----Generate the list of numbers to sort
while(iter < 100000):
list.append(random.randint(0,10)) #modify this to change sorting speed
iter = iter + 1
timetogenerate = time.clock() - start #current timer - last timer snapshot
#-----Sort the list of numbers
list = quickSort(list)
timetosort = time.clock() - timetogenerate #current timer - last timer snapshot
#-----Write the list of numbers
file = open("C:\output.txt", 'w')
for items in list:
file.write(str(items))
file.write("\n")
file.close()
timetowrite = time.clock() - timetosort #current timer - last timer snapshot
#-----Print info
print "time to start: " + str(start)
print "time to generate: " + str(timetogenerate)
print "time to sort: " + str(timetosort)
print "time to write: " + str(timetowrite)
totaltime = timetogenerate + timetosort + start
print "total time: " + str(totaltime)
-------------------revised NEW code----------------------------
def quickSort(array): #random pivot location quicksort. uses extra memory.
smaller = []
greater = []
equal = []
if len(array) <= 1:
return array
pivotVal = array[random.randint(0, len(array)-1)]
array.remove(pivotVal)
equal.append(pivotVal)
for items in array:
if items < pivotVal:
smaller.append(items)
elif items > pivotVal:
greater.append(items)
else:
equal.append(items)
return concat(quickSort(smaller), equal, quickSort(greater))
def concat(before, equal, after):
new = []
for items in before:
new.append(items)
for items in equal:
new.append(items)
for items in after:
new.append(items)
return new
I think this has to do with the choice of a pivot. Depending on how your partition step works, if you have a lot of duplicate values, your algorithm can degenerate to quadratic behavior when confronted with many duplicates. For example, suppose that you're trying to quicksort this stream:
[0 0 0 0 0 0 0 0 0 0 0 0 0]
If you aren't careful with how you do the partitioning step, this can degenerate quickly. For example, suppose you pick your pivot as the first 0, leaving you with the array
[0 0 0 0 0 0 0 0 0 0 0 0]
to partition. Your algorithm might say that the smaller values are the array
[0 0 0 0 0 0 0 0 0 0 0 0]
And the larger values are the array
[]
This is the case that causes quicksort to degenerate to O(n2), since each recursive call is only shrinking the size of the input by one (namely, by pulling off the pivot element).
I noticed that in your code, your partitioning step does indeed do this:
for items in array:
if items <= pivotVal:
smaller.append(items)
else:
greater.append(items)
Given a stream that's a whole bunch of copies of the same element, this will put all of them into one array to recursively sort.
Of course, this seems like a ridiculous case - how is this at all connected to reducing the number of values in the array? - but it actually does come up when you're sorting lots of elements that aren't distinct. In particular, after a few passes of the partitioning, you're likely to group together all equal elements, which will bring you into this case.
For a discussion of how to prevent this from happening, there's a really great talk by Bob Sedgewick and Jon Bentley about how to modify the partition step to work quickly when in the presence of duplicate elements. It's connected to Dijkstra's Dutch national flag problem, and their solutions are really clever.
One option that works is to partition the input into three groups - less, equal, and greater. Once you've broken the input up this way, you only need to sort the less and greater groups; the equal groups are already sorted. The above link to the talk shows how to do this more or less in-place, but since you're already using an out-of-place quicksort the fix should be easy. Here's my attempt at it:
for items in array:
if items < pivotVal:
smaller.append(items)
elif items == pivotVal:
equal.append(items)
else:
greater.append(items)
I've never written a line of Python in my life, BTW, so this may be totally illegal syntax. But I hope the idea is clear! :-)
Things we know:
Time complexity for quick sort of unordered array is O(n*logn).
If the array is already sorted, it degrades to O(n^2).
First two statements are not discrete, i.e. the closer an array is to being sorted, the closer is time complexity of quick sort to O(n^2), and reversely as we shuffle it the complexity approaches O(n*logn)
Now, let's look at your experiment:
In all three cases you used the same number of elements. So, our n which you named x is always 100000.
In your first experiment, you used numbers between 0 and 100000, so ideally with a perfect random number generator you'd get mostly different numbers in a relatively unordered list, thus fitting the O(n*logn) complexity case.
In your third experiment, you used numbers between 0 an 10 in a 100000 elements large list. It means that there were quite many duplicates in your list, making it a lot closer to a sorted list than in the first experiment. So, in that case time complexity was much closer to O(n^2).
And with the same large enough n you can say that n*logn > n^2, which you actually confirmed by your experiment.
The quicksort algorithm has a known weakness--it is slower when the data is mostly sorted. When you have 100000 between 0 and 10 they will be closer to being 'mostly sorted' than 100000 numbers in the range of 0 to 100000.

Categories

Resources