Efficiently get n random documents from a Whoosh index

Efficiently get n random documents from a Whoosh index - python

Given a large Whoosh index, how can I efficiently retrieve n random documents from it?
I can do this horribly inefficiently just by pulling all the documents into memory and using random.sample...
random.sample(list(some_index.searcher().documents()), n)
but that will be horribly inefficient (in terms of memory usage and disk IO) if the index contains a large number of documents.

There might be a better way, but what worked for me in similar situations was assigning a random number to every document while indexing. Every document gets a field named rand_id with a random number. You can then generate another random number x at the time of searching and search for rand_id > x. You can then limit the search to n items. If the search didn't yield enough results, search again for rand_id < x and take the rest.

Just create a new numeric field ID that should be unique and preferably auto-increment. Whoosh has not auto-increment , you should do it yourself.
Then to get your random list, just generate a list of random integers using random.randint(1, MAX_ID) than build a search query "ID:2 or ID:16 or ID:43 or ..." and use it for querying , you will get your desired list.
You can query an interval without knowing the max limit or the min limit. for example:
ID:[ 10 to ]
ID:[ to 10]
ID:[ 1 to 10]
ID:2
ID:2 | ID:3

Related

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.

This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?

Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.

Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())

The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

Optimizing random sampling for scaling permutations

I have a two-fold problem here.
1) My initial problem, which was trying to improve the computational time on my algorithm
2) My next problem, which is one of my 'improvements' appears to consume over 500G RAM memory and I don't really know why.
I've been writing a permutation for a bioinformatics pipeline. Basically, the algorithm starts off sampling 1000 null variants for each variant I have. If a certain condition is not met, the algorithm repeats the sampling with 10000 nulls this time. If the condition is not met again, this up-scales to 100000, and then 1000000 nulls per variant.
I've done everything I can think of to optimize this, and at this point I'm stumped for improvements. I have this gnarly looking comprehension:
output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list]))
Basically all this is doing is calling my random_sampling_for_variants function on each variant in my variant list, and throwing the two outputs from that function into a list of lists (so, I end up with two lists of lists, output_names_df, and output_p_values_df). I then turn these list of lists back into DataFrames, rename the columns, and do whatever I need with them. The function being called looks like:
def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num):
"""
Inner function to permute_null_variants_single
equivalent to permutation for 1 variant
"""
#Get nulls that are in the same bin as our variant
permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant])
#If number of permutations > number of nulls, sample with replacement
if num_permutations >= len(permuted_null_variant_table):
replace = True
else:
replace = False
#Select rows for permutation, then add as columns to our temporary dfs
picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace)
temp_names_df = picked_indices['variant'].values[:use_num]
temp_p_values_df = picked_indices['GWAS_p_value'].values
return(temp_names_df, temp_p_values_df)
When defining permuted_null_variant_table, I'm just querying a pre-defined grouped up DataFrame to determine the appropriate null table to sample from. I found this to be faster than trying to determine the appropriate nulls to sample from on the fly. The logic that follows in there determines whether I need to sample with or without replacement, and it takes hardly any time at all. Defining picked_indices is where the random sampling is actually being done. temp_names_df and temp_p_values_df get the values that I want out of the null table, and then I send my returns out of the function.
The problem here is that the above doesn't scale very well. For 1000 permutations, the above lines of code take ~254 seconds on 7725 variants. For 10,000 permutations, ~333 seconds. For 100,000 permutations, ~720 seconds. For 1,000,000 permutations, which is the upper boundary I'd like to get to, my process got killed because it was apparently going to take up more RAM than the cluster had.
I'm stumped on how to proceed. I've turned all by ints into 8bit-uint, I've turned all my floats into 16bit-floats. I've dropped columns that I don't need where I don't need them, so I'm only ever sampling from tables with the columns I need. Initially I had the comprehension in the form of a loop, but the computational time on the comprehension is 3/5th that of the loop. Any feedback is appreciated.

Sometimes my set comes out ordered and sometimes not (Python)

So I know that a set is supposed to be an unordered list. I am trying to do some coding of my own and ended up with a weird happening. My set will sometimes go in order from 1 - 100 (when using a larger number) and when I use a smaller number it will stay unordered. Why is that?
#Steps:
#1) Take a number value for total random numbers in 1-100
#2) Put those numbers into a set (which will remove duplicates)
#3) Print that set and the total number of random numbers
import random
randomnums = 0
Min = int(1)
Max = int(100)
print('How many random numbers would you like?')
numsneeded = int(input('Please enter a number. '))
print("\n" * 25)
s = set()
while (randomnums < numsneeded):
number = random.randint(Min, Max)
s.add(number)
randomnums = randomnums + 1
print s
print len(s)
If anyone has any pointers on cleaning up my code I am 100% willing to learn. Thank you for your time!

When the documentation for set says it is an unordered collection, it only means that you can assume no specific order on the elements of the set. The set can choose what internal representation it uses to hold the data, and when you ask for the elements, they might come back in any order at all. The fact that they are sorted in some cases might mean that the set has chosen to store your elements in a sorted manner.
The set can make tradeoff decisions between performance and space depending on factors such as the number of elements in the set. For example, it could store small sets in a list, but larger sets in a tree. The most natural way to retrieve elements from a tree is in sorted order, so that's what could be happening for you.
See also Can Python's set absence of ordering be considered random order? for further info about this.

Sets are implemented with a hash implementation. The hash of an integer is just the integer. To determine where to put the number in the table the remainder of the integer when divided by the table size is used. The table starts with a size of 8, so the numbers 0 to 7 would be placed in their own slot in order, but 8 would be placed in the 0 slot. If you add the numbers 1 to 4 and 8 into an empty set it will display as:
set([8,1,2,3,4])
What happens when 5 is added is that the table has exceeded 2/3rds full. At that point the table is increased in size to 32. When creating the new table the existing table is repopulated into the new table. Now it displays as:
set([1,2,3,4,5,8])
In your example as long as you've added enough entries to cause the table to have 128 entries, then they will all be placed in the table in their own bins in order. If you've only added enough entries that the table has 32 slots, but you are using numbers up to 100 the items won't necessarily be in order.

Given a list L labeled 1 to N, and a process that "removes" a random element from consideration, how can one efficiently keep track of min(L)?

The question is pretty much in the title, but say I have a list L
L = [1,2,3,4,5]
min(L) = 1 here. Now I remove 4. The min is still 1. Then I remove 2. The min is still 1. Then I remove 1. The min is now 3. Then I remove 3. The min is now 5, and so on.
I am wondering if there is a good way to keep track of the min of the list at all times without needing to do min(L) or scanning through the entire list, etc.
There is an efficiency cost to actually removing the items from the list because it has to move everything else over. Re-sorting the list each time is expensive, too. Is there a way around this?

To remove a random element you need to know what elements have not been removed yet.
To know the minimum element, you need to sort or scan the items.
A min heap implemented as an array neatly solves both problems. The cost to remove an item is O(log N) and the cost to find the min is O(1). The items are stored contiguously in an array, so choosing one at random is very easy, O(1).
The min heap is described on this Wikipedia page
BTW, if the data are large, you can leave them in place and store pointers or indexes in the min heap and adjust the comparison operator accordingly.

Google for self-balancing binary search trees. Building one from the initial list takes O(n lg n) time, and finding and removing an arbitrary item will take O(lg n) (instead of O(n) for finding/removing from a simple list). A smallest item will always appear in the root of the tree.
This question may be useful. It provides links to several implementation of various balanced binary search trees. The advice to use a hash table does not apply well to your case, since it does not address maintaining a minimum item.

Here's a solution that need O(N lg N) preprocessing time + O(lg N) update time and O(lg(n)*lg(n)) delete time.
Preprocessing:
step 1: sort the L
step 2: for each item L[i], map L[i]->i
step 3: Build a Binary Indexed Tree or segment tree where for every 1<=i<=length of L, BIT[i]=1 and keep the sum of the ranges.
Query type delete:
Step 1: if an item x is said to be removed, with a binary search on array L (where L is sorted) or from the mapping find its index. set BIT[index[x]] = 0 and update all the ranges. Runtime: O(lg N)
Query type findMin:
Step 1: do a binary search over array L. for every mid, find the sum on BIT from 1-mid. if BIT[mid]>0 then we know some value<=mid is still alive. So we set hi=mid-1. otherwise we set low=mid+1. Runtime: O(lg**2N)
Same can be done with Segment tree.
Edit: If I'm not wrong per query can be processed in O(1) with Linked List

If sorting isn't in your best interest, I would suggest only do comparisons where you need to do them. If you remove elements that are not the old minimum, and you aren't inserting any new elements, there isn't a re-scan necessary for a minimum value.
Can you give us some more information about the processing going on that you are trying to do?
Comment answer: You don't have to compute min(L). Just keep track of its index and then only re-run the scan for min(L) when you remove at(or below) the old index (and make sure you track it accordingly).

Your current approach of rescanning when the minimum is removed is O(1)-time in expectation for each removal (assuming every item is equally likely to be removed).
Given a list of n items, a rescan is necessary with probability 1/n, so the expected work at each step is n * 1/n = O(1).

How to get two random records with Django

How do I get two distinct random records using Django? I've seen questions about how to get one but I need to get two random records and they must differ.

The order_by('?')[:2] solution suggested by other answers is actually an extraordinarily bad thing to do for tables that have large numbers of rows. It results in an ORDER BY RAND() SQL query. As an example, here's how mysql handles that (the situation is not much different for other databases). Imagine your table has one billion rows:
To accomplish ORDER BY RAND(), it needs a RAND() column to sort on.
To do that, it needs a new table (the existing table has no such column).
To do that, mysql creates a new, temporary table with the new columns and copies the existing ONE BILLION ROWS OF DATA into it.
As it does so, it does as you asked, and runs rand() for every row to fill in that value. Yes, you've instructed mysql to GENERATE ONE BILLION RANDOM NUMBERS. That takes a while. :)
A few hours/days later, when it's done it now has to sort it. Yes, you've instructed mysql to SORT THIS ONE BILLION ROW, WORST-CASE-ORDERED TABLE (worst-case because the sort key is random).
A few days/weeks later, when that's done, it faithfully grabs the two measly rows you actually needed and returns them for you. Nice job. ;)
Note: just for a little extra gravy, be aware that mysql will initially try to create that temp table in RAM. When that's exhausted, it puts everything on hold to copy the whole thing to disk, so you get that extra knife-twist of an I/O bottleneck for nearly the entire process.
Doubters should look at the generated query to confirm that it's ORDER BY RAND() then Google for "order by rand()" (with the quotes).
A much better solution is to trade that one really expensive query for three cheap ones (limit/offset instead of ORDER BY RAND()):
import random
last = MyModel.objects.count() - 1
index1 = random.randint(0, last)
# Here's one simple way to keep even distribution for
# index2 while still gauranteeing not to match index1.
index2 = random.randint(0, last - 1)
if index2 == index1: index2 = last
# This syntax will generate "OFFSET=indexN LIMIT=1" queries
# so each returns a single record with no extraneous data.
MyObj1 = MyModel.objects.all()[index1]
MyObj2 = MyModel.objects.all()[index2]

If you specify the random operator in the ORM I'm pretty sure it will give you two distinct random results won't it?
MyModel.objects.order_by('?')[:2] # 2 random results.

For the future readers.
Get the the list of ids of all records:
my_ids = MyModel.objects.values_list('id', flat=True)
my_ids = list(my_ids)
Then pick n random ids from all of the above ids:
n = 2
rand_ids = random.sample(my_ids, n)
And get records for these ids:
random_records = MyModel.objects.filter(id__in=rand_ids)

Object.objects.order_by('?')[:2]
This would return two random-ordered records. You can add
distinct()
if there are records with the same value in your dataset.

About sampling n random values from a sequence, the random lib could be used,
random.Random().sample(range(0,last),2)
will fetch 2 random samples from among the sequence elements, 0 to last-1

from django.db import models
from random import randint
from django.db.models.aggregates import Count
class ProductManager(models.Manager):
def random(self, count=5):
index = randint(0, self.aggregate(count=Count('id'))['count'] - count)
return self.all()[index:index + count]
You can get different number of objects.

class ModelName(models.Model):
# Define model fields etc
#classmethod
def get_random(cls, n=2):
"""Returns a number of random objects. Pass number when calling"""
import random
n = int(n) # Number of objects to return
last = cls.objects.count() - 1
selection = random.sample(range(0, last), n)
selected_objects = []
for each in selection:
selected_objects.append(cls.objects.all()[each])
return selected_objects

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently get n random documents from a Whoosh index - python

Related

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Optimizing random sampling for scaling permutations

Sometimes my set comes out ordered and sometimes not (Python)

Given a list L labeled 1 to N, and a process that "removes" a random element from consideration, how can one efficiently keep track of min(L)?

How to get two random records with Django

Categories

Resources