Searching values in a large matrix - python

Im working with python 3.5 and Im writing a script that handles large spreadsheet files. Each row of the spreadsheet contains a phrase and several other relevant values. I'm parsing the file as a matrix, but for the example file, it has over 3000 rows (and even larger files should be within expected). I also have a list of 100 words. I need to search for each word, which row of the matrix contains it in its string, and print the some averages based on that.
Currently I'm iterating over each row of the matrix, and then check if the string contains any of the mentioned words, but this process takes 3000 iterations, with 100 checks for each one. Is there any better way to accomplish this task?

In the long run, I would encourage you to use something more suitable for the task. A SQL database, for instance.
But if you stick with writing your own python solution, here are some things you can do to optimize it:
Use sets. Sets have a very efficient membership check.
wordset_100 = set(worldlist_100)
for row in data_3k:
word_matches = wordset_100.intersect(row.phrase.split(" "))
for match in word_matches:
# add to accumulator
# this loop will be run less than len(row.phrase.split(' ')) times
pass
Parallelize.
from multiprocessing import Pool
from collections import defaultdict
def matches(wordset_100, row):
return wordset_100.intersect(row.phrase.split(" ")), row
if __name__ == "__main__":
accu = defaultdict(int)
p = Pool()
wordset_100 = set(worldlist_100)
for m, r in p.map(matches, data_3k):
for word in m:
accu[word] += r.number

Related

Is there a way to force SymSpell Python to return more than one correction recommendation?

I'm using the symspellpy module in Python for query correction. It is really useful and fast, but I'm having a issue with it.
Is there a way to force Symspell to return more than one recommendation for correction. I need it to analyse a better correction based on my application.
I'm calling Symspell like this:
suggestions = sym_spell.lookup(query, VERBOSITY_ALL, max_edit_distance=3)
Example of what I'm trying to do:
query = "resende". The return that I want ["resende", "rezende"]. What the method returns ["resende"]. Note that both "resende" and "rezende" are in my dictionary.
Merely a typo. Change the underscore in
Verbosity_ALL ... to
Verbosity.ALL
The three options are CLOSEST, TOP and ALL
Couple of other things in SymSpell ...
Four algorithm choices
Described here
Supported edit distance algorithm choices.
LEVENSHTEIN = 0 Levenshtein algorithm
DAMERAU_OSA = 1 Damerau optimal string alignment algorithm (default)
LEVENSHTEIN_FAST = 2 Fast Levenshtein algorithm
DAMERAU_OSA_FAST = 3 Fast Damerau optimal string alignment algorithm
DAMERAU_OSA # high count/frequency wins when using .ALL but distances tied?
LEVENSHTEIN # lowest edit distance wins (fewest changes needed)
To change from the default, overwrite it with one of them:
from symspellpy.editdistance import DistanceAlgorithm
sym_spell._distance_algorithm = DistanceAlgorithm.LEVENSHTEIN
Output object details
word = 'something'
matches = sym_spell.lookup(word, Verbosity.ALL, max_edit_distance=2)
for match in matches: # match is ... term, distance, count
print(f'{word} -> {match.term} {match.distance} {match.count}')
Using collections Counter() with SymSpell instead of loading words from file
SymSpell can only read the dictionary of ok words from a file currently (Apr 2022) however this can be added inside symspellpy.py to make it able to read from a collections Counter() output dict or other dictionary of words : counts, a mere quick hack that works for my purposes ...
def load_Counter_dictionary(self, counts_each):
for key, count in counts_each.items():
self.create_dictionary_entry(key, count)
Can then drop the use of load_dictionary(), for something like this instead ...
sym_spell.load_Counter_dictionary( Counter(words_list) )
The reason I resorted to that is a million+ record csv file was already loaded into a pandas dataframe containing a column of codes (think words) with some of them in large numbers (likely correct) along with outliers to be corrected and a column already made containing their counts each. So rather than saving the counts dict to file (expensive) and the reload by SymSpell, this is direct and efficient.

Computing with a large data file

I have a very large (say a few thousand) list of partitions, something like:
[[9,0,0,0,0,0,0,0,0],
[8,1,0,0,0,0,0,0,0],
...,
[1,1,1,1,1,1,1,1,1]]
What I want to do is apply to each of them a function (which outputs a small number of partitions), then put all the outputs in a list and remove duplicates.
I am able to do this, but the problem is that my computer gets very slow if I put the above list directly into the python file (esp. when scrolling). What is making it slow? If it is memory being used to load the whole list,
Is there a way to put the partitions in another file, and have the function just read the list term by term?
EDIT: I am adding some code. My code is probably very inefficient because I'm quite an amateur. So what I really have is a list of lists of partitions, that I want to add to:
listofparts3 = [[[3],[2,1],[1,1,1]],
[[6],[5,1],...,[1,1,1,1,1,1]],...]
def addtolist3(n):
a=int(n/3)-2
counter = 0
added = []
for i in range(len(listofparts3[a])):
first = listofparts3[a][i]
if len(first)<n:
for i in range(n-len(first)):
first.append(0)
answer = lowering1(fock(first),-2)[0]
for j in range(len(answer)):
newelement = True
for k in range(len(added)):
if (answer[j]==added[k]).all():
newelement = False
break
if newelement==True:
added.append(answer[j])
print(counter)
counter = counter+1
for i in range(len(added)):
added[i]=partition(added[i]).tolist()
return(added)
fock, lowering1, partition are all functions in earlier code, they are pretty simple functions. The above function, say addtolist(24), takes all the partition of 21 that I have and returns the desired list of partitions of 24, which I can then append to the end of listofparts3.
A few thousand partitions uses only a modest amount of memory, so that likely isn't the source of your problem.
One way to speed-up function application is to use map() for Python 3 or itertools.imap() from Python 2.
The fastest way to eliminate duplicates is to feed them into a Python set() object.

Python collections Counter with large data

I have two text files both consisting of approximately 700,000 lines.
Second file consists of responses to statements in the first file for corresponding line.
I need to calculate Fisher's Exact Score for each word pair that appears on matching lines.
For example, if nth lines in the files are
how are you
and
fine thanx
then I need to calculate Fisher's score for (how,fine), (how,thanx), (are,fine), (are,thanx), (you,fine), (you,thanx).
In order to calculate Fisher's Exact Score, I used collections module's Counter to count the number of appearances of each word, and their co-appearances throughout the two files, as in
with open("finalsrc.txt") as f1, open("finaltgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(set(list(find_words(line1))))
words2 = list(set(list(find_words(line2))))
counts1.update(words1)
counts2.update(words2)
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
then I calculate the Fisher's exact score for each pair using scipy module by
from scipy import stats
def calculateFisher(s, t):
sa = counts1[s]
ta = counts2[t]
st = counts_pair[s, t]
snt = sa - st
nst = ta - st
nsnt = n - sa - ta + st
oddsratio, pvalue = stats.fisher_exact([[st, snt], [nst, nsnt]])
return pvalue
This works fast and fine for small text files,
but since my files contain 700,000 lines each, I think the Counter gets too large to retrieve the values quickly, and this becomes very very slow.
(Assuming 10 words per each sentence, the counts_pair would have (10^2)*700,000=70,000,000 entries.)
It would take tens of days to finish the computation for all word pairs in the files.
What would be the smart workaround for this?
I would greatly appreciate your help.
How exactly are you calling the calculateFisher function? Your counts_pair will not have 70 million entries: a lot of word pairs will be seen more than once, so seventy million is the sum of their counts, not the number of keys. You should be only calculating the exact test for pairs that do co-occur, and the best place to find those is in counts_pair. But that means that you can just iterate over it; and if you do, you never have to look anything up in counts_pair:
for (s, t), count in counts_pair.iteritems():
sa = counts1[s]
ta = counts2[t]
st = count
# Continue with Fisher's exact calculation
I've factored out the calculate_fisher function for clarity; I hope you get the idea. So if dictionary look-ups were what's slowing you down, this will save you a whole lot of them. If not, ... do some profiling and let us know what's really going on.
But note that simply looking up keys in a huge dictionary shouldn't slow things down too much. However, "retrieving values quickly" will be difficult if your program must to swap most of its data to disk. Do you have enough memory in your computer to hold the three counters simultaneously? Does the first loop complete in a reasonable amount of time? So find the bottleneck and you'll know more about what needs fixing.
Edit: From your comment it sounds like you are calculating Fisher's exact score over and over during a subsequent step of text processing. Why do that? Break up your program in two steps: First, calculate all word pair scores as I describe. Write each pair and score out into a file as you calculate it. When that's done, use a separate script to read them back in (now the memory contains nothing else but this one large dictionary of pairs & Fisher's exact scores), and rewrite away. You should do that anyway: If it takes you ten days just to get the scores (and you *still haven't given us any details on what's slow, and why), get started and in ten days you'll have them forever, to use whenever you wish.
I did a quick experiment, and a python process with a list of a million ((word, word), count) tuples takes just 300MB (on OS X, but the data structures should be about the same size on Windows). If you have 10 million distinct word pairs, you can expect it to take about 2.5 GB of RAM. I doubt you'll have even this many word pairs (but check!). So if you've got 4GB of RAM and you're not doing anything wrong that you haven't told us about, you should be all right. Otherwise, YMMV.
I think that your bottleneck is in how you manipulate the data structures other than the counters.
words1 = list(set(list(find_words(line1)))) creates a list from a set from a list from the result of find_words. Each of these operations requires allocating memory to hold all of your objects, and copying. Worse still, if the type returned by find_words does not include a __len__ method, the resulting list will have to grow and be recopied as it iterates.
I'm assuming that all you need is an iterable of unique words in order to update your counters, for which set will be perfectly sufficient.
for line1, line2 in itertools.izip(f1, f2):
words1 = set(find_words(line1)) # words1 now has list of unique words from line1
words2 = set(find_words(line2)) # words2 now has list of unique words from line2
counts1.update(words1) # counts1 increments words from line1 (once per word)
counts2.update(words2) # counts2 increments words from line2 (once per word)
counts_pair.update(itertools.product(words1, words2)
Note that you don't need to change the output of itertools.product that is passed to counts_pair as there are no repeated elements in words1 or words2, so the Cartesian product will not have any repeated elements.
Sounds like you need to generate the cross-products lazily - a Counter with 70 million elements will take a lot of RAM and suffer from cache misses on virtually every access.
So how about instead save a dict mapping a "file 1" word to a list of sets of corresponding "file 2" words?
Initial:
word_to_sets = collections.defaultdict(list)
Replace:
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
with:
for w1 in words1:
word_to_sets[w1].append(words2)
Then in your Fisher function, replace this:
st = counts_pair[s, t]
with:
st = sum(t in w2set for w2set in word_to_sets.get(s, []))
That's as lazy as I can get - the cross-products are never computed at all ;-)
EDIT Or map a "list 1" word to its own Counter:
Initial:
word_to_counter = collections.defaultdict(collections.Counter)
Replace:
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
with:
for w1 in words1:
word_to_counter[w1].update(words2)
In Fisher function:
st = word_to_counter[s][t]

Python memory error for a large data set

I want to generate a 'bag of words' matrix containing documents with the corresponding counts for the words in the document. In order to do this I run below code for initialising the bag of words matrix. Unfortunately I receive a memory error after x amounts of documents in the line where I read the document. Is there a better way of doing this, so that I can avoid the memory error? Please be aware that I would like to process a very large amount of documents ~ 2.000.000 with only 8 Gb of RAM.
def __init__(self, paths, words_count, normalize_matrix = False ,trainingset_size = None, validation_set_words_list = None):
'''
Open all documents from the given path.
Initialize the variables needed in order
to construct the word matrix.
Parameters
----------
paths: paths to the documents.
words_count: number of words in the bag of words.
trainingset_size: the proportion of the data that should be set to the training set.
validation_set_words_list: the attributes for validation.
'''
print '################ Data Processing Started ################'
self.max_words_matrix = words_count
print '________________ Reading Docs From File System ________________'
timer = time()
for folder in paths:
self.class_names.append(folder.split('/')[len(folder.split('/'))-1])
print '____ dataprocessing for category '+folder
if trainingset_size == None:
docs = os.listdir(folder)
elif not trainingset_size == None and validation_set_words_list == None:
docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)]
else:
docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):]
count = 1
length = len(docs)
for doc in docs:
if doc.endswith('.txt'):
d = open(folder+'/'+doc).read()
# Append a filtered version of the document to the document list.
self.docs_list.append(self.__filter__(d))
# Append the name of the document to the list containing document names.
self.docs_names.append(doc)
# Increase the class indices counter.
self.class_indices.append(len(self.class_names)-1)
print 'Processed '+str(count)+' of '+str(length)+' in category '+folder
count += 1
What you're asking for isn't possible. Also, Python doesn't automatically get the space benefits you're expecting from BoW. Plus, I think you're doing the key piece wrong in the first place. Let's take those in reverse order.
Whatever you're doing in this line:
self.docs_list.append(self.__filter__(d))
… is likely wrong.
All you want to store for each document is a count vector. In order to get that count vector, you will need to append to a single dict of all words seen. Unless __filter__ is modifying a hidden dict in-place, and returning a vector, it's not doing the right thing.
The main space savings in the BoW model come from not having to store copies of the string keys for each document, and from being able to store a simple array of ints instead of a fancy hash table. But an integer object is nearly as big as a (short) string object, and there's no way to predict or guarantee when you get new integers or strings vs. additional references to existing ones. So, really, the only advantage you get is 1/hash_fullness; if you want any of the other advantages, you need something like an array.array or numpy.ndarray.
For example:
a = np.zeros(len(self.word_dict), dtype='i2')
for word in split_into_words(d):
try:
idx = self.word_dict[word]
except KeyError:
idx = len(self.word_dict)
self.word_dict[word] = idx
np.resize(a, idx+1)
a[idx] = 1
else:
a[idx] += 1
self.doc_vectors.append(a)
But this still won't be enough. Unless you have on the order of 1K unique words, you can't fit all those counts in memory.
For example, if you have 5000 unique words, you've got 2M arrays, each of which has 5000 2-byte counts, so the most compact possible representation will take 20GB.
Since most documents won't have most words, you will get some benefit by using sparse arrays (or a single 2D sparse array), but there's only so much benefit you can get. And, even if things happened to be ordered in such a way that you get absolutely perfect RLE compression, if the average number of unique words per doc is on the order of 1K, you're still going to run out of memory.
So, you simply can't store all of the document vectors in memory.
If you can process them iteratively instead of all at once, that's the obvious answer.
If not, you'll have to page them in and out to disk (whether explicitly, or by using PyTables or a database or something).

How to get two random records with Django

How do I get two distinct random records using Django? I've seen questions about how to get one but I need to get two random records and they must differ.
The order_by('?')[:2] solution suggested by other answers is actually an extraordinarily bad thing to do for tables that have large numbers of rows. It results in an ORDER BY RAND() SQL query. As an example, here's how mysql handles that (the situation is not much different for other databases). Imagine your table has one billion rows:
To accomplish ORDER BY RAND(), it needs a RAND() column to sort on.
To do that, it needs a new table (the existing table has no such column).
To do that, mysql creates a new, temporary table with the new columns and copies the existing ONE BILLION ROWS OF DATA into it.
As it does so, it does as you asked, and runs rand() for every row to fill in that value. Yes, you've instructed mysql to GENERATE ONE BILLION RANDOM NUMBERS. That takes a while. :)
A few hours/days later, when it's done it now has to sort it. Yes, you've instructed mysql to SORT THIS ONE BILLION ROW, WORST-CASE-ORDERED TABLE (worst-case because the sort key is random).
A few days/weeks later, when that's done, it faithfully grabs the two measly rows you actually needed and returns them for you. Nice job. ;)
Note: just for a little extra gravy, be aware that mysql will initially try to create that temp table in RAM. When that's exhausted, it puts everything on hold to copy the whole thing to disk, so you get that extra knife-twist of an I/O bottleneck for nearly the entire process.
Doubters should look at the generated query to confirm that it's ORDER BY RAND() then Google for "order by rand()" (with the quotes).
A much better solution is to trade that one really expensive query for three cheap ones (limit/offset instead of ORDER BY RAND()):
import random
last = MyModel.objects.count() - 1
index1 = random.randint(0, last)
# Here's one simple way to keep even distribution for
# index2 while still gauranteeing not to match index1.
index2 = random.randint(0, last - 1)
if index2 == index1: index2 = last
# This syntax will generate "OFFSET=indexN LIMIT=1" queries
# so each returns a single record with no extraneous data.
MyObj1 = MyModel.objects.all()[index1]
MyObj2 = MyModel.objects.all()[index2]
If you specify the random operator in the ORM I'm pretty sure it will give you two distinct random results won't it?
MyModel.objects.order_by('?')[:2] # 2 random results.
For the future readers.
Get the the list of ids of all records:
my_ids = MyModel.objects.values_list('id', flat=True)
my_ids = list(my_ids)
Then pick n random ids from all of the above ids:
n = 2
rand_ids = random.sample(my_ids, n)
And get records for these ids:
random_records = MyModel.objects.filter(id__in=rand_ids)
Object.objects.order_by('?')[:2]
This would return two random-ordered records. You can add
distinct()
if there are records with the same value in your dataset.
About sampling n random values from a sequence, the random lib could be used,
random.Random().sample(range(0,last),2)
will fetch 2 random samples from among the sequence elements, 0 to last-1
from django.db import models
from random import randint
from django.db.models.aggregates import Count
class ProductManager(models.Manager):
def random(self, count=5):
index = randint(0, self.aggregate(count=Count('id'))['count'] - count)
return self.all()[index:index + count]
You can get different number of objects.
class ModelName(models.Model):
# Define model fields etc
#classmethod
def get_random(cls, n=2):
"""Returns a number of random objects. Pass number when calling"""
import random
n = int(n) # Number of objects to return
last = cls.objects.count() - 1
selection = random.sample(range(0, last), n)
selected_objects = []
for each in selection:
selected_objects.append(cls.objects.all()[each])
return selected_objects

Categories

Resources