How to get two random records with Django - python

How do I get two distinct random records using Django? I've seen questions about how to get one but I need to get two random records and they must differ.

The order_by('?')[:2] solution suggested by other answers is actually an extraordinarily bad thing to do for tables that have large numbers of rows. It results in an ORDER BY RAND() SQL query. As an example, here's how mysql handles that (the situation is not much different for other databases). Imagine your table has one billion rows:
To accomplish ORDER BY RAND(), it needs a RAND() column to sort on.
To do that, it needs a new table (the existing table has no such column).
To do that, mysql creates a new, temporary table with the new columns and copies the existing ONE BILLION ROWS OF DATA into it.
As it does so, it does as you asked, and runs rand() for every row to fill in that value. Yes, you've instructed mysql to GENERATE ONE BILLION RANDOM NUMBERS. That takes a while. :)
A few hours/days later, when it's done it now has to sort it. Yes, you've instructed mysql to SORT THIS ONE BILLION ROW, WORST-CASE-ORDERED TABLE (worst-case because the sort key is random).
A few days/weeks later, when that's done, it faithfully grabs the two measly rows you actually needed and returns them for you. Nice job. ;)
Note: just for a little extra gravy, be aware that mysql will initially try to create that temp table in RAM. When that's exhausted, it puts everything on hold to copy the whole thing to disk, so you get that extra knife-twist of an I/O bottleneck for nearly the entire process.
Doubters should look at the generated query to confirm that it's ORDER BY RAND() then Google for "order by rand()" (with the quotes).
A much better solution is to trade that one really expensive query for three cheap ones (limit/offset instead of ORDER BY RAND()):
import random
last = MyModel.objects.count() - 1
index1 = random.randint(0, last)
# Here's one simple way to keep even distribution for
# index2 while still gauranteeing not to match index1.
index2 = random.randint(0, last - 1)
if index2 == index1: index2 = last
# This syntax will generate "OFFSET=indexN LIMIT=1" queries
# so each returns a single record with no extraneous data.
MyObj1 = MyModel.objects.all()[index1]
MyObj2 = MyModel.objects.all()[index2]

If you specify the random operator in the ORM I'm pretty sure it will give you two distinct random results won't it?
MyModel.objects.order_by('?')[:2] # 2 random results.

For the future readers.
Get the the list of ids of all records:
my_ids = MyModel.objects.values_list('id', flat=True)
my_ids = list(my_ids)
Then pick n random ids from all of the above ids:
n = 2
rand_ids = random.sample(my_ids, n)
And get records for these ids:
random_records = MyModel.objects.filter(id__in=rand_ids)

Object.objects.order_by('?')[:2]
This would return two random-ordered records. You can add
distinct()
if there are records with the same value in your dataset.

About sampling n random values from a sequence, the random lib could be used,
random.Random().sample(range(0,last),2)
will fetch 2 random samples from among the sequence elements, 0 to last-1

from django.db import models
from random import randint
from django.db.models.aggregates import Count
class ProductManager(models.Manager):
def random(self, count=5):
index = randint(0, self.aggregate(count=Count('id'))['count'] - count)
return self.all()[index:index + count]
You can get different number of objects.

class ModelName(models.Model):
# Define model fields etc
#classmethod
def get_random(cls, n=2):
"""Returns a number of random objects. Pass number when calling"""
import random
n = int(n) # Number of objects to return
last = cls.objects.count() - 1
selection = random.sample(range(0, last), n)
selected_objects = []
for each in selection:
selected_objects.append(cls.objects.all()[each])
return selected_objects

Related

Gensim docvecs.most_similar returns Id's that dont exist

I'm trying create an algorithm that's capable of show the top n documents similar to a specific document.
For that i used the gensim doc2vec. The code is bellow:
model = gensim.models.doc2vec.Doc2Vec(size=400, window=8, min_count=5, workers = 11,
dm=0,alpha = 0.025, min_alpha = 0.025, dbow_words = 1)
model.build_vocab(train_corpus)
for x in xrange(10):
model.train(train_corpus)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(train_corpus)
model.save('model_EN_BigTrain')
sims = model.docvecs.most_similar([408], topn=10)
The sims var should give me 10 tuples, being the first element the id of the doc and the second the score.
The problem is that some id's do not correspond to any document in my training data.
I've been trying for some time now to make sense out of the ids that aren't in my training data but i don't see any logic.
Ps: This is the code that i used to create my train_corpus
def readData(train_corpus, jData):
print("The response contains {0} properties".format(len(jData)))
print("\n")
for i in xrange(len(jData)):
print "> Reading offers from Aux array"
if i % 10 == 0:
print ">>", i, "offers processed..."
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(jData[i][1]), tags=[jData[i][0]]))
print "> Finished processing offers"
Being each position of the aux array one array in witch the position 0 is an int (that i want to be the id) and the position 1 a description
Thanks in advance.
Are you using plain integer IDs as your tags, but not using exactly all of the integers from 0 to whatever your MAX_DOC_ID is?
If so, that could explain the appearance of tags within that range. When you use plain ints, gensim Doc2Vec avoids creating a dict mapping provided tags to index-positions in its internal vector-array – and just uses the ints themselves.
Thus that internal vector-array must be allocated to include MAX_DOC_ID + 1 rows. Any rows corresponding to unused IDs are still initialized as random vectors, like all the positions, but won't receive any of the training from actual text examples to push them into meaningful relative positions. It's thus possible these random-initialized-but-untrained vectors could appear in later most_similar() results.
To avoid that, either use only contiguous ints from 0 to the last ID you need. Or, if you can afford the memory cost of the string-to-index mapping, use string tags instead of plain ints. Or, keep an extra record of the valid IDs and manually filter the unwanted IDs from results.
Separately: by not specifying iter=1 in your Doc2Vec model initialization, the default of iter=5 will be in effect, meaning each call to train() does 5 iterations over your data. Oddly, also, your xrange(10) for-loop includes two separate calls to train() each iteration (and the 1st is just using whatever alpha/min_alpha was already in place). So you're actually doing 10 * 2 * 5 = 100 passes over the data, with an odd learning-rate schedule.
I suggest instead if you want 10 passes to just set iter=10, leave default alpha/min_alpha untouched, and then call train() only once. The model will do 10 passes, smoothly managing alpha from its starting to ending values.
I was having this problem as well, I was initializing my doc2vec with the following:
for idx,doc in data.iterrows():
alldocs.append(TruthDocument(doc['clean_text'], [idx], doc['label']))
I was passing it a dataframe that had some wonk indexes. All I had to do was.
df.reset_index(inplace=True)

Searching values in a large matrix

Im working with python 3.5 and Im writing a script that handles large spreadsheet files. Each row of the spreadsheet contains a phrase and several other relevant values. I'm parsing the file as a matrix, but for the example file, it has over 3000 rows (and even larger files should be within expected). I also have a list of 100 words. I need to search for each word, which row of the matrix contains it in its string, and print the some averages based on that.
Currently I'm iterating over each row of the matrix, and then check if the string contains any of the mentioned words, but this process takes 3000 iterations, with 100 checks for each one. Is there any better way to accomplish this task?
In the long run, I would encourage you to use something more suitable for the task. A SQL database, for instance.
But if you stick with writing your own python solution, here are some things you can do to optimize it:
Use sets. Sets have a very efficient membership check.
wordset_100 = set(worldlist_100)
for row in data_3k:
word_matches = wordset_100.intersect(row.phrase.split(" "))
for match in word_matches:
# add to accumulator
# this loop will be run less than len(row.phrase.split(' ')) times
pass
Parallelize.
from multiprocessing import Pool
from collections import defaultdict
def matches(wordset_100, row):
return wordset_100.intersect(row.phrase.split(" ")), row
if __name__ == "__main__":
accu = defaultdict(int)
p = Pool()
wordset_100 = set(worldlist_100)
for m, r in p.map(matches, data_3k):
for word in m:
accu[word] += r.number

Efficiently get n random documents from a Whoosh index

Given a large Whoosh index, how can I efficiently retrieve n random documents from it?
I can do this horribly inefficiently just by pulling all the documents into memory and using random.sample...
random.sample(list(some_index.searcher().documents()), n)
but that will be horribly inefficient (in terms of memory usage and disk IO) if the index contains a large number of documents.
There might be a better way, but what worked for me in similar situations was assigning a random number to every document while indexing. Every document gets a field named rand_id with a random number. You can then generate another random number x at the time of searching and search for rand_id > x. You can then limit the search to n items. If the search didn't yield enough results, search again for rand_id < x and take the rest.
Just create a new numeric field ID that should be unique and preferably auto-increment. Whoosh has not auto-increment , you should do it yourself.
Then to get your random list, just generate a list of random integers using random.randint(1, MAX_ID) than build a search query "ID:2 or ID:16 or ID:43 or ..." and use it for querying , you will get your desired list.
You can query an interval without knowing the max limit or the min limit. for example:
ID:[ 10 to ]
ID:[ to 10]
ID:[ 1 to 10]
ID:2
ID:2 | ID:3

Sometimes my set comes out ordered and sometimes not (Python)

So I know that a set is supposed to be an unordered list. I am trying to do some coding of my own and ended up with a weird happening. My set will sometimes go in order from 1 - 100 (when using a larger number) and when I use a smaller number it will stay unordered. Why is that?
#Steps:
#1) Take a number value for total random numbers in 1-100
#2) Put those numbers into a set (which will remove duplicates)
#3) Print that set and the total number of random numbers
import random
randomnums = 0
Min = int(1)
Max = int(100)
print('How many random numbers would you like?')
numsneeded = int(input('Please enter a number. '))
print("\n" * 25)
s = set()
while (randomnums < numsneeded):
number = random.randint(Min, Max)
s.add(number)
randomnums = randomnums + 1
print s
print len(s)
If anyone has any pointers on cleaning up my code I am 100% willing to learn. Thank you for your time!
When the documentation for set says it is an unordered collection, it only means that you can assume no specific order on the elements of the set. The set can choose what internal representation it uses to hold the data, and when you ask for the elements, they might come back in any order at all. The fact that they are sorted in some cases might mean that the set has chosen to store your elements in a sorted manner.
The set can make tradeoff decisions between performance and space depending on factors such as the number of elements in the set. For example, it could store small sets in a list, but larger sets in a tree. The most natural way to retrieve elements from a tree is in sorted order, so that's what could be happening for you.
See also Can Python's set absence of ordering be considered random order? for further info about this.
Sets are implemented with a hash implementation. The hash of an integer is just the integer. To determine where to put the number in the table the remainder of the integer when divided by the table size is used. The table starts with a size of 8, so the numbers 0 to 7 would be placed in their own slot in order, but 8 would be placed in the 0 slot. If you add the numbers 1 to 4 and 8 into an empty set it will display as:
set([8,1,2,3,4])
What happens when 5 is added is that the table has exceeded 2/3rds full. At that point the table is increased in size to 32. When creating the new table the existing table is repopulated into the new table. Now it displays as:
set([1,2,3,4,5,8])
In your example as long as you've added enough entries to cause the table to have 128 entries, then they will all be placed in the table in their own bins in order. If you've only added enough entries that the table has 32 slots, but you are using numbers up to 100 the items won't necessarily be in order.

Python memory error for a large data set

I want to generate a 'bag of words' matrix containing documents with the corresponding counts for the words in the document. In order to do this I run below code for initialising the bag of words matrix. Unfortunately I receive a memory error after x amounts of documents in the line where I read the document. Is there a better way of doing this, so that I can avoid the memory error? Please be aware that I would like to process a very large amount of documents ~ 2.000.000 with only 8 Gb of RAM.
def __init__(self, paths, words_count, normalize_matrix = False ,trainingset_size = None, validation_set_words_list = None):
'''
Open all documents from the given path.
Initialize the variables needed in order
to construct the word matrix.
Parameters
----------
paths: paths to the documents.
words_count: number of words in the bag of words.
trainingset_size: the proportion of the data that should be set to the training set.
validation_set_words_list: the attributes for validation.
'''
print '################ Data Processing Started ################'
self.max_words_matrix = words_count
print '________________ Reading Docs From File System ________________'
timer = time()
for folder in paths:
self.class_names.append(folder.split('/')[len(folder.split('/'))-1])
print '____ dataprocessing for category '+folder
if trainingset_size == None:
docs = os.listdir(folder)
elif not trainingset_size == None and validation_set_words_list == None:
docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)]
else:
docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):]
count = 1
length = len(docs)
for doc in docs:
if doc.endswith('.txt'):
d = open(folder+'/'+doc).read()
# Append a filtered version of the document to the document list.
self.docs_list.append(self.__filter__(d))
# Append the name of the document to the list containing document names.
self.docs_names.append(doc)
# Increase the class indices counter.
self.class_indices.append(len(self.class_names)-1)
print 'Processed '+str(count)+' of '+str(length)+' in category '+folder
count += 1
What you're asking for isn't possible. Also, Python doesn't automatically get the space benefits you're expecting from BoW. Plus, I think you're doing the key piece wrong in the first place. Let's take those in reverse order.
Whatever you're doing in this line:
self.docs_list.append(self.__filter__(d))
… is likely wrong.
All you want to store for each document is a count vector. In order to get that count vector, you will need to append to a single dict of all words seen. Unless __filter__ is modifying a hidden dict in-place, and returning a vector, it's not doing the right thing.
The main space savings in the BoW model come from not having to store copies of the string keys for each document, and from being able to store a simple array of ints instead of a fancy hash table. But an integer object is nearly as big as a (short) string object, and there's no way to predict or guarantee when you get new integers or strings vs. additional references to existing ones. So, really, the only advantage you get is 1/hash_fullness; if you want any of the other advantages, you need something like an array.array or numpy.ndarray.
For example:
a = np.zeros(len(self.word_dict), dtype='i2')
for word in split_into_words(d):
try:
idx = self.word_dict[word]
except KeyError:
idx = len(self.word_dict)
self.word_dict[word] = idx
np.resize(a, idx+1)
a[idx] = 1
else:
a[idx] += 1
self.doc_vectors.append(a)
But this still won't be enough. Unless you have on the order of 1K unique words, you can't fit all those counts in memory.
For example, if you have 5000 unique words, you've got 2M arrays, each of which has 5000 2-byte counts, so the most compact possible representation will take 20GB.
Since most documents won't have most words, you will get some benefit by using sparse arrays (or a single 2D sparse array), but there's only so much benefit you can get. And, even if things happened to be ordered in such a way that you get absolutely perfect RLE compression, if the average number of unique words per doc is on the order of 1K, you're still going to run out of memory.
So, you simply can't store all of the document vectors in memory.
If you can process them iteratively instead of all at once, that's the obvious answer.
If not, you'll have to page them in and out to disk (whether explicitly, or by using PyTables or a database or something).

Categories

Resources