How do I make a list of random unique tuples?

How do I make a list of random unique tuples? - python

I've looked over several answers similar to this question, and all seem to have good oneliner answers that however only deal with the fact of making the list unique by removing duplicates. I need the list to have exactly 5.
The only code I could come up with is as such:
from random import *
tuples = []
while len(tuples) < 5:
rand = (randint(0, 6), randint(0,6))
if rand not in tuples:
tuples.append(rand)
I feel like there is a simpler way but I can't figure it out. I tried playing with sample() from random:
sample((randint(0,6), randint(0,6)), 5)
But this gives me a "Sample larger than population or is negative" error.

One quick way is to use itertools.product to generate all tuple possibilities before using sample to choose 5 from them:
from itertools import product
from random import sample
sample(list(product(range(7), repeat=2)), k=5)

For such a small set of inputs, just generate all possible outputs, and sample them:
import itertools
import random
size = 6
random.sample(list(itertools.product(range(size+1), repeat=2)), 5)
You indicate that the bounds (size) may be a parameter though, and if the bounds might be even a little larger, this could be a problem (you'd be generating size ** 2 tuples to select 5 of them, and the memory usage could get out of control). If that's a problem, given you only need a pair of integers, there is a cheap trick: Choose one random integer that encodes both resulting integers, then decode it. For example:
size = 6
raw_sample = random.sample(range((size + 1) ** 2), 5)
decoded_sample = [divmod(x, size+1) for x in raw_sample)]
Since range is zero overhead (the memory usage doesn't depend on the length), you can select precisely five values from it with overhead proportionate to the five selected, not the 49 possible results. You then compute the quotient and remainder based on the range of a single value (0 to size inclusive in this case, so size + 1 possible values), and that gets the high and low results cheaply.
The performance differences are stark; comparing:
def unique_random_pairs_by_product(size):
return random.sample(list(itertools.product(range(size+1), repeat=2)), 5)
to:
def unique_random_pairs_optimized(size):
val_range = size + 1
return [divmod(x, val_range) for x in random.sample(range(val_range * val_range), 5)]
the optimized version takes about 15% less time even for an argument of 6 (~4.65 μs for product, ~3.95 μs for optimized). But at size of 6, you're not seeing the scaling factor at all. For size=100, optimized only increases to ~4.35 μs (the time increasing slightly because the larger range is more likely to have to allocate new ints, instead of using the small int cache), while product jumps to 387 μs, a nearly 100x difference. And for size=1000, the time for product jumps to 63.8 ms, while optimized remains ~4.35 μs; a factor of 10,000x difference in runtime (and an even higher multiplier on memory usage). If size gets any larger than that, the product-based solution will quickly reach the point where the delay from even a single sampling is noticeable to humans; the optimized solution will continue to run with identical performance (modulo incredibly tiny differences in the cost of the divmod).

Related

Itertools combinations, ¿How to make it faster?

I am coding this program that takes 54 (num1) numbers and puts them in a list. It then takes 16 (num2) of those numbers and forms a list that contains lists of 16 numbers chosen from all the combinations possible of "num1"c"num2". It then takes those lists and generates 4x4 arrays.
The code I have works, but running 54 numbers to get all the arrays I want will take a long time. I know this because I have tested the code using from 20 up to 40 numbers and timed it.
20 numbers = 0.000055 minutes
30 numbers = 0.045088 minutes
40 numbers = 17.46944 minutes
Using all the 20 points of test data I got, I built a math model to predict how long it would take to run the 54 numbers, and I am getting 1740 minutes = 29 hours. This is already an improvement from a v1 of this code that was predicting 38 hours and from a v0 that was actually crashing my machine.
I am reaching out to you to try and make this run even faster. The program is not even RAM intensive. I have 8GB of RAM and a core i7 processor, it does not slow down my machine at all. It actually runs very smooth compared to previous versions I had where my computer even crashed a few times.
Do you guys think there is a way? I am currently sampling to reduce processing time but I would prefer not to sample it at all, if that's possible. I am not even printing the arrays also to reduce processing time, I am just printing a counter to see how many combinations I generated.
This is the code:
import numpy as np
import itertools
from itertools import combinations
from itertools import islice
from random import sample
num1 = 30 #ideally this is 54, just using 30 now so you can test it.
num2 = 16
steps = 1454226 #represents 1% of "num1"c"num2" just to reduce processing time for testing.
nums=list()
for i in range(1,num1+1):
nums.append(i)
#print ("nums: ", nums) #Just to ensure that I am indeed using numbers from 1 to num1
vun=list()
tabl=list()
counter = 0
combin = islice(itertools.combinations(nums, num2),0,None,steps)
for i in set(combin):
vun.append(sample(i,num2))
counter = counter + 1
p1=i[0];p2=i[1];p3=i[2];p4=i[3];p5=i[4];p6=i[5];p7=i[6];p8=i[7];p9=i[8]
p10=i[9];p11=i[10];p12=i[11];p13=i[12];p14=i[13];p15=i[14];p16=i[15]
tun = np.array ([(p1,p2,p3,p4),(p5,p6,p7,p8),(p9,p10,p11,p12),(p13,p14,p15,p16)])
tabl.append(tun)
# print ("TABL:" ,tabl)
# print ("vun: ", vun)
print ("combinations:",counter)
The output I get with this code is:
combinations: 101
Ideally, this number would be 2.109492366(10)¹³ or at least 1%. As long as it runs the 54x16 and does not take 29 hours.

The main inefficiency comes from generating all the combinations (itertools.combinations(nums, num2)), only to throw most of them away.
Another approach would be to generate combinations at random, ensuring there are no duplicates.
import itertools
import random
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(range(n), r))
return tuple(pool[i] for i in indices)
items = list(range(1, 55))
samples = set()
while len(samples) < 100_000:
sample = random_combination(items, 16)
samples.add(sample)
for sample in samples:
board = list(sample)
random.shuffle(board)
board = [board[0:4], board[4: 8], board[8: 12], board[12: 16]]
print("done")
This uses the random_combination function from answers to this question, which in turn comes from the itertools documentation.
The code generates 100,000 unique 4x4 samples in about ten seconds, at least on my machine.
A few notes:
Each sample is a tuple and the entries are sorted; this means we can store them in a set and avoid duplicates.
Because of the first point, we shuffle each sample before creating a 4x4 board from it; the later code doesn't do anything with these boards, but I wanted to include them to get a sense of the timing.
It's possible that there would be lots of hash collisions if you were to sample a large proportion of the space, but that's not feasible anyway because of the amount of data that would involve (see below).
I think there's been some confusion about what you are trying to achieve here.
54C16 = 2.1 x 10^13 ... to store 16 8-bit integers for all of these points would take 2.7 x 10^15 bits, which is 337.5 terabytes. That's beyond what could be stored on a local disk.
So, to cover even 1% of the space would take over 3TB ... maybe possible to store on disk at a push. You hinted in the question that you'd like to cover this proportion of the space. Clearly that's not going to happen in 8GB of RAM.

Just calculating the number of combinations is trivial, since it's just a formula:
import math
math.comb(30, 16)
# 145422675
math.comb(54, 16)
# 21094923659355
The trouble is that storing the results of the 16 of 30 case requires about 64 GB of RAM on my machine. You might but probably don't have that much RAM just sitting around like I do. The 16 of 54 case requires about 9.3 PB of RAM, which no modern architecture supports.
You're going to need to take one of two approaches:
Limit to the 16 in 30 case, and don't store any results into vun or tabl.
Pros: Can be made to work in < 5 minutes in my testing.
Cons: Doesn't work for the 16 in 54 case at all, no additional processing is practical
Do a Monte Carlo simulation instead: generate randomized combinations up to some large but reachable sample count and do your math on those.
Pros: Fast and supports both 16 of 30, and 16 of 54 potentially with the same time performance
Cons: Results will have some random variation depending on random seed, should be statistically treated to get confidence intervals for validity.
Note: The formulas used for confidence intervals depend on which actual math you intend to do with these numbers, the average is a good place to start though if you're only looking for an estimate.
I strongly suggest option (2), the Monte Carlo simulation.

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.

This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?

Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.

Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())

The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

Parallel algorithm for set splitting

I'm trying to solve an issue with subsection set.
The input data is the list and the integer.
The case is to divide a set into N-elements subsets whose sum of element is almost equal. As this is an NP-hard problem I try two approaches:
a) iterate all possibilities and distribute it using mpi4py to many machines (the list above 100 elements and 20 element subsets working too long)
b) using mpi4py send the list to different seed but in this case I potentially calculate the same set many times. For instance of 100 numbers and 5 subsets with 20 elements each in 60s my result could be easily better by human simply looking for the table.
Finally I'm looking for heuristic algorithm, which could be computing in distributed system and create N-elements subsets from bigger set whose sum is almost equal.
a = [range(12)]
k = 3
One of the possible solution:
[1,2,11,12] [3,4,9,10] [5,6,7,8]
because sum is 26, 26, 26
Not always it is possible to create exactly the equal sums or number of
elements. The difference between maximum and minimum number of elements in
sets could be 0 (if len(a)/k is integer) or 1.
edit 1:
I investigate two option: 1. Parent generate all iteration and then send to the parallel algorithm (but this is slow for me). 2. Parent send a list and each node generates own subsets and calculated the subset sum in restricted time. Then send the best result to parent. Parent received this results and choose the best one with minimized the difference between sums in subsets. I think the second option has potential to be faster.
Best regards,
Szczepan

I think you're trying to do something more complicated than necessary - do you actually need an exact solution (global optimum)? Regarding the heuristic solution, I had to do something along these lines in the past so here's my take on it:
Reformulate the problem as follows: You have a vector with given mean ('global mean') and you want to break it into chunks such that means of each individual chunk will be as close as possible to the 'global mean'.
Just divide it into chunks randomly and then iteratively swap elements between the chunks until you get acceptable results. You can experiment with different ways how to do it, here I'm just reshuffling elements of chunks with the minimum at maximum 'chunk-mean'.
In general, the bigger the chunk is, the easier it becomes, because the first random split would already give you not-so-bad solution (think sample means).
How big is your input list? I tested this with 100000 elements input (uniform distribution integers). With 50 2000-elements chunks you get the result instantly, with 2000 50-elements chunks you need to wait <1min.
import numpy as np
my_numbers = np.random.randint(10000, size=100000)
chunks = 50
iter_limit = 10000
desired_mean = my_numbers.mean()
accepatable_range = 0.1
split = np.array_split(my_numbers, chunks)
for i in range(iter_limit):
split_means = np.array([array.mean() for array in split]) # this can be optimized, some of the means are known
current_min = split_means.min()
current_max = split_means.max()
mean_diff = split_means.ptp()
if(i % 100 == 0 or mean_diff <= accepatable_range):
print("Iter: {}, Desired: {}, Min {}, Max {}, Range {}".format(i, desired_mean, current_min, current_max, mean_diff))
if mean_diff <= accepatable_range:
print('Acceptable solution found')
break
min_index = split_means.argmin()
max_index = split_means.argmax()
if max_index < min_index:
merged = np.hstack((split.pop(min_index), split.pop(max_index)))
else:
merged = np.hstack((split.pop(max_index), split.pop(min_index)))
reshuffle_range = mean_diff+1
while reshuffle_range > mean_diff:
# this while just ensures that you're not getting worse split, either the same or better
np.random.shuffle(merged)
modified_arrays = np.array_split(merged, 2)
reshuffle_range = np.array([array.mean() for array in modified_arrays]).ptp()
split += modified_arrays

What is optimal algorithm to check if a given integer is equal to sum of two elements of an int array?

def check_set(S, k):
S2 = k - S
set_from_S2=set(S2.flatten())
for x in S:
if(x in set_from_S2):
return True
return False
I have a given integer k. I want to check if k is equal to sum of two element of array S.
S = np.array([1,2,3,4])
k = 8
It should return False in this case because there are no two elements of S having sum of 8. The above code work like 8 = 4 + 4 so it returned True
I can't find an algorithm to solve this problem with complexity of O(n).
Can someone help me?

You have to account for multiple instances of the same item, so set is not good choice here.
Instead you can exploit dictionary with value_field = number_of_keys (as variant - from collections import Counter)
A = [3,1,2,3,4]
Cntr = {}
for x in A:
if x in Cntr:
Cntr[x] += 1
else:
Cntr[x] = 1
#k = 11
k = 8
ans = False
for x in A:
if (k-x) in Cntr:
if k == 2 * x:
if Cntr[k-x] > 1:
ans = True
break
else:
ans = True
break
print(ans)
Returns True for k=5,6 (I added one more 3) and False for k=8,11

Adding onto MBo's answer.
"Optimal" can be an ambiguous term in terms of algorithmics, as there is often a compromise between how fast the algorithm runs and how memory-efficient it is. Sometimes we may also be interested in either worst-case resource consumption or in average resource consumption. We'll loop at worst-case here because it's simpler and roughly equivalent to average in our scenario.
Let's call n the length of our array, and let's consider 3 examples.
Example 1
We start with a very naive algorithm for our problem, with two nested loops that iterate over the array, and check for every two items of different indices if they sum to the target number.
Time complexity: worst-case scenario (where the answer is False or where it's True but that we find it on the last pair of items we check) has n^2 loop iterations. If you're familiar with the big-O notation, we'll say the algorithm's time complexity is O(n^2), which basically means that in terms of our input size n, the time it takes to solve the algorithm grows more or less like n^2 with multiplicative factor (well, technically the notation means "at most like n^2 with a multiplicative factor, but it's a generalized abuse of language to use it as "more or less like" instead).
Space complexity (memory consumption): we only store an array, plus a fixed set of objects whose sizes do not depend on n (everything Python needs to run, the call stack, maybe two iterators and/or some temporary variables). The part of the memory consumption that grows with n is therefore just the size of the array, which is n times the amount of memory required to store an integer in an array (let's call that sizeof(int)).
Conclusion: Time is O(n^2), Memory is n*sizeof(int) (+O(1), that is, up to an additional constant factor, which doesn't matter to us, and which we'll ignore from now on).
Example 2
Let's consider the algorithm in MBo's answer.
Time complexity: much, much better than in Example 1. We start by creating a dictionary. This is done in a loop over n. Setting keys in a dictionary is a constant-time operation in proper conditions, so that the time taken by each step of that first loop does not depend on n. Therefore, for now we've used O(n) in terms of time complexity. Now we only have one remaining loop over n. The time spent accessing elements our dictionary is independent of n, so once again, the total complexity is O(n). Combining our two loops together, since they both grow like n up to a multiplicative factor, so does their sum (up to a different multiplicative factor). Total: O(n).
Memory: Basically the same as before, plus a dictionary of n elements. For the sake of simplicity, let's consider that these elements are integers (we could have used booleans), and forget about some of the aspects of dictionaries to only count the size used to store the keys and the values. There are n integer keys and n integer values to store, which uses 2*n*sizeof(int) in terms of memory. Add to that what we had before and we have a total of 3*n*sizeof(int).
Conclusion: Time is O(n), Memory is 3*n*sizeof(int). The algorithm is considerably faster when n grows, but uses three times more memory than example 1. In some weird scenarios where almost no memory is available (embedded systems maybe), this 3*n*sizeof(int) might simply be too much, and you might not be able to use this algorithm (admittedly, it's probably never going to be a real issue).
Example 3
Can we find a trade-off between Example 1 and Example 2?
One way to do that is to replicate the same kind of nested loop structure as in Example 1, but with some pre-processing to replace the inner loop with something faster. To do that, we sort the initial array, in place. Done with well-chosen algorithms, this has a time-complexity of O(n*log(n)) and negligible memory usage.
Once we have sorted our array, we write our outer loop (which is a regular loop over the whole array), and then inside that outer loop, use dichotomy to search for the number we're missing to reach our target k. This dichotomy approach would have a memory consumption of O(log(n)), and its time complexity would be O(log(n)) as well.
Time complexity: The pre-processing sort is O(n*log(n)). Then in the main part of the algorithm, we have n calls to our O(log(n)) dichotomy search, which totals to O(n*log(n)). So, overall, O(n*log(n)).
Memory: Ignoring the constant parts, we have the memory for our array (n*sizeof(int)) plus the memory for our call stack in the dichotomy search (O(log(n))). Total: n*sizeof(int) + O(log(n)).
Conclusion: Time is O(n*log(n)), Memory is n*sizeof(int) + O(log(n)). Memory is almost as small as in Example 1. Time complexity is slightly more than in Example 2. In scenarios where the Example 2 cannot be used because we lack memory, the next best thing in terms of speed would realistically be Example 3, which is almost as fast as Example 2 and probably has enough room to run if the very slow Example 1 does.
Overall conclusion
This answer was just to show that "optimal" is context-dependent in algorithmics. It's very unlikely that in this particular example, one would choose to implement Example 3. In general, you'd see either Example 1 if n is so small that one would choose whatever is simplest to design and fastest to code, or Example 2 if n is a bit larger and we want speed. But if you look at the wikipedia page I linked for sorting algorithms, you'll see that none of them is best at everything. They all have scenarios where they could be replaced with something better.

Memory efficient sort of massive numpy array in Python

I need to sort a VERY large genomic dataset using numpy. I have an array of 2.6 billion floats, dimensions = (868940742, 3) which takes up about 20GB of memory on my machine once loaded and just sitting there. I have an early 2015 13' MacBook Pro with 16GB of RAM, 500GB solid state HD and an 3.1 GHz intel i7 processor. Just loading the array overflows to virtual memory but not to the point where my machine suffers or I have to stop everything else I am doing.
I build this VERY large array step by step from 22 smaller (N, 2) subarrays.
Function FUN_1 generates 2 new (N, 1) arrays using each of the 22 subarrays which I call sub_arr.
The first output of FUN_1 is generated by interpolating values from sub_arr[:,0] on array b = array([X, F(X)]) and the second output is generated by placing sub_arr[:, 0] into bins using array r = array([X, BIN(X)]). I call these outputs b_arr and rate_arr, respectively. The function returns a 3-tuple of (N, 1) arrays:
import numpy as np
def FUN_1(sub_arr):
"""interpolate b values and rates based on position in sub_arr"""
b = np.load(bfile)
r = np.load(rfile)
b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
rate_arr = np.searchsorted(r[:,0], sub_arr[:,0]) # HUGE efficiency gain over np.digitize...
return r[rate_r, 1], b_arr, sub_arr[:,1]
I call the function 22 times in a for-loop and fill a pre-allocated array of zeros full_arr = numpy.zeros([868940742, 3]) with the values:
full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1
In terms of saving memory at this step, I think this is the best I can do, but I'm open to suggestions. Either way, I don't run into problems up through this point and it only takes about 2 minutes.
Here is the sorting routine (there are two consecutive sorts)
for idx in range(2):
sort_idx = numpy.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
# ...
# <additional processing, return small (1000, 3) array of stats>
Now this sort had been working, albeit slowly (takes about 10 minutes). However, I recently started using a larger, more fine resolution table of [X, F(X)] values for the interpolation step above in FUN_1 that returns b_arr and now the SORT really slows down, although everything else remains the same.
Interestingly, I am not even sorting on the interpolated values at the step where the sort is now lagging. Here are some snippets of the different interpolation files - the smaller one is about 30% smaller in each case and far more uniform in terms of values in the second column; the slower one has a higher resolution and many more unique values, so the results of interpolation are likely more unique, but I'm not sure if this should have any kind of effect...?
bigger, slower file:
17399307 99.4
17493652 98.8
17570460 98.2
17575180 97.6
17577127 97
17578255 96.4
17580576 95.8
17583028 95.2
17583699 94.6
17584172 94
smaller, more uniform regular file:
1 24
1001 24
2001 24
3001 24
4001 24
5001 24
6001 24
7001 24
I'm not sure what could be causing this issue and I would be interested in any suggestions or just general input about sorting in this type of memory limiting case!

At the moment each call to np.argsort is generating a (868940742, 1) array of int64 indices, which will take up ~7 GB just by itself. Additionally, when you use these indices to sort the columns of full_arr you are generating another (868940742, 1) array of floats, since fancy indexing always returns a copy rather than a view.
One fairly obvious improvement would be to sort full_arr in place using its .sort() method. Unfortunately, .sort() does not allow you to directly specify a row or column to sort by. However, you can specify a field to sort by for a structured array. You can therefore force an inplace sort over one of the three columns by getting a view onto your array as a structured array with three float fields, then sorting by one of these fields:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
In this case I'm sorting full_arr in place by the 0th field, which corresponds to the first column. Note that I've assumed that there are three float64 columns ('f8') - you should change this accordingly if your dtype is different. This also requires that your array is contiguous and in row-major format, i.e. full_arr.flags.C_CONTIGUOUS == True.
Credit for this method should go to Joe Kington for his answer here.
Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using np.argsort to generate an index array, as you mentioned in the comments below (see this previous question). If you use np.argsort to obtain a set of indices to sort by, you might see a modest performance gain by using np.take rather than direct indexing to get the sorted array:
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop
However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy.
Regarding your question about why sorting the second array is faster - yes, you should expect any reasonable sorting algorithm to be faster when there are fewer unique values in the array because on average there's less work for it to do. Suppose I have a random sequence of digits between 1 and 10:
5 1 4 8 10 2 6 9 7 3
There are 10! = 3628800 possible ways to arrange these digits, but only one in which they are in ascending order. Now suppose there are just 5 unique digits:
4 4 3 2 3 1 2 5 1 5
Now there are 2⁵ = 32 ways to arrange these digits in ascending order, since I could swap any pair of identical digits in the sorted vector without breaking the ordering.
By default, np.ndarray.sort() uses Quicksort. The qsort variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Values that are equal to the pivot are already sorted. Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array.
For example:
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop
In this example the dtypes of the two arrays are the same. If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller.

EDIT: In case anyone new to programming and numpy comes across this post, I want to point out the importance of considering the np.dtype that you are using. In my case, I was actually able to get away with using half-precision floating point, i.e. np.float16, which reduced a 20GB object in memory to 5GB and made sorting much more manageable. The default used by numpy is np.float64, which is a lot of precision that you may not need. Check out the doc here, which describes the capacity of the different data types. Thanks to #ali_m for pointing this out in the comments.
I did a bad job explaining this question but I have discovered some helpful workarounds that I think would be useful to share for anyone who needs to sort a truly massive numpy array.
I am building a very large numpy array from 22 "sub-arrays" of human genome data containing the elements [position, value]. Ultimately, the final array must be numerically sorted "in place" based on the values in a particular column and without shuffling the values within rows.
The sub-array dimensions follow the form:
arr1.shape = (N1, 2)
...
arr22.shape = (N22, 2)
sum([N1..N2]) = 868940742 i.e. there are close to 1BN positions to sort.
First I process the 22 sub-arrays with the function process_sub_arrs, which returns a 3-tuple of 1D arrays the same length as the input. I stack the 1D arrays into a new (N, 3) array and insert them into an np.zeros array initialized for the full dataset:
full_arr = np.zeros([868940742, 3])
i, j = 0, 0
for arr in list(arr1..arr22):
# indices (i, j) incremented at each loop based on sub-array size
j += len(arr)
full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
i = j
return full_arr
EDIT: Since I realized my dataset could be represented with half-precision floats, I now initialize full_arr as follows: full_arr = np.zeros([868940742, 3], dtype=np.float16), which is only 1/4 the size and much easier to sort.
Result is a massive 20GB array:
full_arr.nbytes = 20854577808
As #ali_m pointed out in his detailed post, my earlier routine was inefficient:
sort_idx = np.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
the array sort_idx, which is 33% the size of full_arr, hangs around and wastes memory after sorting full_arr. This sort supposedly generates a copy of full_arr due to "fancy" indexing, potentially pushing memory use to 233% of what is already used to hold the massive array! This is the slow step, lasting about ten minutes and relying heavily on virtual memory.
I'm not sure the "fancy" sort makes a persistent copy however. Watching the memory usage on my machine, it seems that full_arr = full_arr[sort_idx] deletes the reference to the unsorted original, because after about 1 second all that is left is the memory used by the sorted array and the index, even if there is a transient copy.
A more compact usage of argsort() to save memory is this one:
full_arr = full_arr[full_arr[:,idx].argsort()]
This still causes a spike at the time of the assignment, where both a transient index array and a transient copy are made, but the memory is almost instantly freed again.
#ali_m pointed out a nice trick (credited to Joe Kington) for generating a de facto structured array with a view on full_arr. The benefit is that these may be sorted "in place", maintaining stable row order:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
Views work great for performing mathematical array operations, but for sorting it is far too inefficient for even a single sub-array from my dataset. In general, structured arrays just don't seem to scale very well even though they have really useful properties. If anyone has any idea why this is I would be interested to know.
One good option to minimize memory consumption and improve performance with very large arrays is to build a pipeline of small, simple functions. Functions clear local variables once they have completed so if intermediate data structures are building up and sapping memory this can be a good solution.
This a sketch of the pipeline I've used to speed up the massive array sort:
def process_sub_arrs(arr):
"""process a sub-array and return a 3-tuple of 1D values arrays"""
return values1, values2, values3
def build_arr():
"""build the initial array by joining processed sub-arrays"""
full_arr = np.zeros([868940742, 3])
i, j = 0, 0
for arr in list(arr1..arr22):
# indices (i, j) incremented at each loop based on sub-array size
j += len(arr)
full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
i = j
return full_arr
def sort_arr():
"""return full_arr and sort_idx"""
full_arr = build_arr()
sort_idx = np.argsort(full_arr[:, index])
return full_arr[sort_idx]
def get_sorted_arr():
"""call through nested functions to return the sorted array"""
sorted_arr = sort_arr()
<process sorted_arr>
return statistics
call stack: get_sorted_arr --> sort_arr --> build_arr --> process_sub_arrs
Once each inner function is completed get_sorted_arr() finally just holds the sorted array and then returns a small array of statistics.
EDIT: It is also worth pointing out here that even if you are able to use a more compact dtype to represent your huge array, you will want to use higher precision for summary calculations. For example, since full_arr.dtype = np.float16, the command np.mean(full_arr[:,idx]) tries to calculate the mean in half-precision floating point, but this quickly overflows when summing over a massive array. Using np.mean(full_arr[:,idx], dtype=np.float64) will prevent the overflow.
I posted this question initially because I was puzzled by the fact that a dataset of identical size suddenly began choking up my system memory, although there was a big difference in the proportion of unique values in the new "slow" set. #ali_m pointed out that, indeed, more uniform data with fewer unique values is easier to sort:
The qsort variant of Quicksort works by recursively selecting a
'pivot' element in the array, then reordering the array such that all
the elements less than the pivot value are placed before it, and all
of the elements greater than the pivot value are placed after it.
Values that are equal to the pivot are already sorted, so intuitively,
the fewer unique values there are in the array, the smaller the number
of swaps there are that need to be made.
On that note, the final change I ended up making to attempt to resolve this issue was to round the newer dataset in advance, since there was an unnecessarily high level of decimal precision leftover from an interpolation step. This ultimately had an even bigger effect than the other memory saving steps, showing that the sort algorithm itself was the limiting factor in this case.
Look forward to other comments or suggestions anyone might have on this topic, and I almost certainly misspoke about some technical issues so I would be glad to hear back :-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.