I want to generate a 'bag of words' matrix containing documents with the corresponding counts for the words in the document. In order to do this I run below code for initialising the bag of words matrix. Unfortunately I receive a memory error after x amounts of documents in the line where I read the document. Is there a better way of doing this, so that I can avoid the memory error? Please be aware that I would like to process a very large amount of documents ~ 2.000.000 with only 8 Gb of RAM.
def __init__(self, paths, words_count, normalize_matrix = False ,trainingset_size = None, validation_set_words_list = None):
'''
Open all documents from the given path.
Initialize the variables needed in order
to construct the word matrix.
Parameters
----------
paths: paths to the documents.
words_count: number of words in the bag of words.
trainingset_size: the proportion of the data that should be set to the training set.
validation_set_words_list: the attributes for validation.
'''
print '################ Data Processing Started ################'
self.max_words_matrix = words_count
print '________________ Reading Docs From File System ________________'
timer = time()
for folder in paths:
self.class_names.append(folder.split('/')[len(folder.split('/'))-1])
print '____ dataprocessing for category '+folder
if trainingset_size == None:
docs = os.listdir(folder)
elif not trainingset_size == None and validation_set_words_list == None:
docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)]
else:
docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):]
count = 1
length = len(docs)
for doc in docs:
if doc.endswith('.txt'):
d = open(folder+'/'+doc).read()
# Append a filtered version of the document to the document list.
self.docs_list.append(self.__filter__(d))
# Append the name of the document to the list containing document names.
self.docs_names.append(doc)
# Increase the class indices counter.
self.class_indices.append(len(self.class_names)-1)
print 'Processed '+str(count)+' of '+str(length)+' in category '+folder
count += 1
What you're asking for isn't possible. Also, Python doesn't automatically get the space benefits you're expecting from BoW. Plus, I think you're doing the key piece wrong in the first place. Let's take those in reverse order.
Whatever you're doing in this line:
self.docs_list.append(self.__filter__(d))
… is likely wrong.
All you want to store for each document is a count vector. In order to get that count vector, you will need to append to a single dict of all words seen. Unless __filter__ is modifying a hidden dict in-place, and returning a vector, it's not doing the right thing.
The main space savings in the BoW model come from not having to store copies of the string keys for each document, and from being able to store a simple array of ints instead of a fancy hash table. But an integer object is nearly as big as a (short) string object, and there's no way to predict or guarantee when you get new integers or strings vs. additional references to existing ones. So, really, the only advantage you get is 1/hash_fullness; if you want any of the other advantages, you need something like an array.array or numpy.ndarray.
For example:
a = np.zeros(len(self.word_dict), dtype='i2')
for word in split_into_words(d):
try:
idx = self.word_dict[word]
except KeyError:
idx = len(self.word_dict)
self.word_dict[word] = idx
np.resize(a, idx+1)
a[idx] = 1
else:
a[idx] += 1
self.doc_vectors.append(a)
But this still won't be enough. Unless you have on the order of 1K unique words, you can't fit all those counts in memory.
For example, if you have 5000 unique words, you've got 2M arrays, each of which has 5000 2-byte counts, so the most compact possible representation will take 20GB.
Since most documents won't have most words, you will get some benefit by using sparse arrays (or a single 2D sparse array), but there's only so much benefit you can get. And, even if things happened to be ordered in such a way that you get absolutely perfect RLE compression, if the average number of unique words per doc is on the order of 1K, you're still going to run out of memory.
So, you simply can't store all of the document vectors in memory.
If you can process them iteratively instead of all at once, that's the obvious answer.
If not, you'll have to page them in and out to disk (whether explicitly, or by using PyTables or a database or something).
Related
I am trying to create a variable number of variables (arrays) in python.
I have a database from experiments and I am extracting data from it. I do not have control over the database or how data is written. I am extracting data in the form of a table - first (or zeroth column from python's perspective) has location ids and subsequent columns have readings over several iterations. Location ids (in 0th col) span over million of rows, and so the readings of the iterations are captured in subsequent columns. So I read over the database and create this giant table.
In the next step, I loop over columns index 1 to n (0th col has locations) and I am trying to get this - if the difference in 2 readings is more than 0.001, then write the location id to an array.
if ( (A[i][j+1] - A[i][j]) > 0.001): #1<=j<=n, 0<=i<=max rows in the table
then write A[i][0] i.e. location id to an array, arr1[m][n] = A[i][0]
Problem: It is creating dynamic number of variables like arr1. I am storing the result of each loop iteration in an array and the number of column j's are known only during runtime. So how can I create variable number of variables like arr1? Secondly, each of these variables like arr1 can have different size.
I took a look at similar questions, but multi-dimension arrays won't work as each arr1 can have different size. Also, performance is important, so I am guessing numpy arrays would be better. I am guessing that dictionary would be slow in performance for such a huge data.
I didn't understand much from your explanation of the problem, but from what you wrote it sounds like a normal list would do the job:
arr1 = []
if (your condition here):
arr1.append(A[i][0])
memory management is dynamic, i.e. it allocates new memory as needed and afterwards if you need a numpy array just make numpy_array = np.asarray(arr1).
A (very) small primer on lists in python:
A list in python is a mutable container that stores references to objects of any kind. Unlike C++, in a python list your items can be anything and you don't have to specify the list size when you define it.
In the example above, arr1 is initially defined as empty and every time you call arr1.append() a new reference to A[i][0] is pushed at the end of the list.
For example:
a = []
a.append(1)
a.append('a string')
b = {'dict_key':'my value'}
a.append(b)
print(a)
displays:
[1, 'a string', {'dict_key': 'my value'}]
As you can see, the list doesn't really care what you append, it will store a reference to the item and increase its size of 1.
I strongly suggest you to take a look at the daa structures documentation for further insight on how lists work and some of their caveats.
I took a look at similar questions, but multi dimension arrays won't work as each arr1 can have different size.
-- but a list of arrays will work, because items in a list can be anything, including arrays of different sizes.
I'm trying create an algorithm that's capable of show the top n documents similar to a specific document.
For that i used the gensim doc2vec. The code is bellow:
model = gensim.models.doc2vec.Doc2Vec(size=400, window=8, min_count=5, workers = 11,
dm=0,alpha = 0.025, min_alpha = 0.025, dbow_words = 1)
model.build_vocab(train_corpus)
for x in xrange(10):
model.train(train_corpus)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(train_corpus)
model.save('model_EN_BigTrain')
sims = model.docvecs.most_similar([408], topn=10)
The sims var should give me 10 tuples, being the first element the id of the doc and the second the score.
The problem is that some id's do not correspond to any document in my training data.
I've been trying for some time now to make sense out of the ids that aren't in my training data but i don't see any logic.
Ps: This is the code that i used to create my train_corpus
def readData(train_corpus, jData):
print("The response contains {0} properties".format(len(jData)))
print("\n")
for i in xrange(len(jData)):
print "> Reading offers from Aux array"
if i % 10 == 0:
print ">>", i, "offers processed..."
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(jData[i][1]), tags=[jData[i][0]]))
print "> Finished processing offers"
Being each position of the aux array one array in witch the position 0 is an int (that i want to be the id) and the position 1 a description
Thanks in advance.
Are you using plain integer IDs as your tags, but not using exactly all of the integers from 0 to whatever your MAX_DOC_ID is?
If so, that could explain the appearance of tags within that range. When you use plain ints, gensim Doc2Vec avoids creating a dict mapping provided tags to index-positions in its internal vector-array – and just uses the ints themselves.
Thus that internal vector-array must be allocated to include MAX_DOC_ID + 1 rows. Any rows corresponding to unused IDs are still initialized as random vectors, like all the positions, but won't receive any of the training from actual text examples to push them into meaningful relative positions. It's thus possible these random-initialized-but-untrained vectors could appear in later most_similar() results.
To avoid that, either use only contiguous ints from 0 to the last ID you need. Or, if you can afford the memory cost of the string-to-index mapping, use string tags instead of plain ints. Or, keep an extra record of the valid IDs and manually filter the unwanted IDs from results.
Separately: by not specifying iter=1 in your Doc2Vec model initialization, the default of iter=5 will be in effect, meaning each call to train() does 5 iterations over your data. Oddly, also, your xrange(10) for-loop includes two separate calls to train() each iteration (and the 1st is just using whatever alpha/min_alpha was already in place). So you're actually doing 10 * 2 * 5 = 100 passes over the data, with an odd learning-rate schedule.
I suggest instead if you want 10 passes to just set iter=10, leave default alpha/min_alpha untouched, and then call train() only once. The model will do 10 passes, smoothly managing alpha from its starting to ending values.
I was having this problem as well, I was initializing my doc2vec with the following:
for idx,doc in data.iterrows():
alldocs.append(TruthDocument(doc['clean_text'], [idx], doc['label']))
I was passing it a dataframe that had some wonk indexes. All I had to do was.
df.reset_index(inplace=True)
I am trying to form an optimized approach to splitting a list of file names(examples shortly) in a x:y ratio based on the file names. This file list was procured using os.scandir (better performance vs os.listdir, src: Python Docs scandir).
Example -
Files (extension disregarded)-
A_1,A_2,...A_10 (here A is filename and 1 is the sample number of the file)
B_1,B_2,...B_10
and so on
Let's say the x:y ratio is 7:3
So I would like 70% of file names (A_1..A7,B_1..B_7) and 30%(A_8--A_10,B_8..B_10) in different lists, it does not matter that the first list should be in that order meaning the files could be A_1,A_9,A_5 etc as long as they are split 7 files in list 1 to 3 files in list 2.
Now it must be noted that this directory is huge (~150k files) and the samples of each type of files vary, i.e. it maybe that files with filename A have 1000 files or it may have only 5. Also there are about 400 unique filenames.
This current solution should not be called a solution at all as it defies the purpose of an accurate ratio for each filename. It is currently splitting the list of fileObjects(basically- name like A, number like 1, data within file A_1 and so on) as a whole in x:y ratio and taking advantage of the fact that entries are yielded in arbitrary order when using os.scandir.
ratio_number = int(len(list_of_fileObjects) *.7)
list_70 = list_of_fileObjects[:ratio_number]
list_30 = list_of_fileObjects[ratio_number:]
My second approach which would at least be a valid solution was to create a list separately for each filename(involves sorting the whole list of files), split it in the ratio and do this for each filename. I am looking for a more pythonic/elegant solution to this problem. Any suggestions or help would be appreciated especially considering the size of data being dealt with.
If I understand the situation correctly, your trying to partition the same proportion of each filename prefix's files. Your current method selects the correct proportion from the whole set of files, but it doesn't consider the different filename prefixes, so it may not get them in the correct proportion (though it will probably be somewhat close, most of the time).
Your second approach avoids that issue by first separating the filenames by prefix, then partitioning each sublist. But if you want a combined list with all the prefixes together, this approach may end up wasting time copying data around, since you have to separate out and then recombine the separate lists by prefix.
I think you can do what you want with a single loop over the filenames. You'll need to keep track of two data points for each filename prefix: The number of files with that prefix you've selected for the first sample and the total number of files with that prefix that you've seen.
ratio = 0.7
prefix_dict = {} # values are lists: [number_selected_for_first_list, total_number_seen]
first_sample = [] # gets a proportion of the files equal to ratio (for each prefix)
second_sample = [] # gets the rest of the files
for filename in list_of_files:
prefix = filename.split("_", 1)[0]
selected_seen = prefix_dict.setdefault(prefix, [0, 0])
selected_seen[1] += 1
if selected_seen[0] < round(ratio * selected_seen[1]):
first_sample.append(filename)
selected_seen[0] += 1
else:
second_sample.append(filename)
The only tricky part to this code is the use of dict.setdefault to fetch the selected_seen list. It if the requested prefix didn't yet exist in the dictionary, a new value ([0, 0]) will be added to the dictionary under that key (and returned). The later code modifies the list in place.
Depending on how exactly you want to handle inexact proportions, you can change the if condition a bit. I put in a round call (which I think will partition most accurately), but the code would work OK without it (biasing the selection towards the second sample) or with selected_seen[0] <= int(ratio * selected_seen[1]) (biasing towards the first sample).
Note that whichever way you choose to round when partitioning each prefix, there's the possibility that the separate prefixes will all end up unbalanced in the same direction, making the overall samples unbalanced by more than you'd normally expect. For instance, if you had ten prefixes with ten files (for 100 files total), a ratio of 7.5 would result in final sample lists of 80 and 20 files rather than 75 and 25. That happens since each of the prefixes gets partitioned 8 and 2 (7.5 rounds up). If every file had a unique prefix, you'd end up with everything in the first sample! If it's very important that the overall samples be the right sizes, you might need to fudge the sampling of the items a bit, based on the overall sample sizes.
I figured out a good solution to this problem.
all_file_names = {}
# ObjList is a list of objects but we only need
# file_name from that object for our solution
for x in ObjList:
if x.file_name not in all_file_names:
all_file_names[x.file_name] = 1
else:
all_file_names[x.file_name] += 1
trainingData = []
testData = []
temp_dict = {}
for x in ObjList:
ratio = int(0.7*all_file_names[x.file_name])+1
if x.file_name not in temp_dict:
temp_dict[x.file_name] = 1
trainingData.append(x)
else:
temp_dict[x.file_name] += 1
if(temp_dict[x.file_name] < ratio):
trainingData.append(x)
else:
testData.append(x)
I have 4 parallel arrays based on a table representing attributes of a map. Each array has approx. 500 values, but all have the same number of values.
The arrays are:
start = location of the endpoint with the smaller flow accumulation,
end = location of the other endpoint (with the larger flow accumulation),
length = segment length, and;
shape = actual shape, oriented to run from start to end.
I am attempting to create a data structure from which I can use a recursive function on to determine the start and end points every 2000m along the length.
The following question and answer describe what I am attempting to accomplish:
https://gis.stackexchange.com/questions/87649/select-points-approx-2000-metres-from-another-point-along-a-river
How do I store these 4 parallel arrays in a dictionary keyed by start?
I am new to writing functions, dictionaries and using arrays in dictionaries. I am attempting to do this task in Python.
I think this is what you mean:
d = {}
for i in range(len(start)):
d[start[i]] = (shape[i],length[i],end[i])
so now d[some_start_value] will hold the corresponding shape length and end values.
If you want to do things a little bit more Python-esque, you can use enumerate:
d = {}
for (i,st) in enumerate(start):
d[st] = (shape[i],length[i],end[i])
or even better - zip:
d = {}
for (st,sh,le,en) in zip(start,shape,length,end):
d[st] = (sh,le,en)
Note that you can leave out the parantheses around the first part of the for loops (i.e. between the for and in keywords). I used them solely for enhanced code readability.
As with WeaselFox's answer, d[some_start_value] will now hold the corresponding shape, length and end values.
In addition to the above answers, I would recommend using namedtuple to simplify accesses:
from collections import namedtuple
# This creates a namedtuple called GISData. Name of the object and name in the first argument
# should be the same.
GISData = namedtuple('GISData', 'start shape length end')
# zip creates 1 list of 4-tuples from 4 single lists
# There are other ways to write this; this is just the shortest for me.
# Note that if you need this ordered, you should use an OrderedDict,
# which is in the collections module in python 2.7+, or you can find
# backported versions for python 2.6+. In those, the keys preserve ordering,
# so can still be searched as a list, which is useful if you need to find e.g.
# 479, which is not in the dictionary, but 400 and 500 are and you have to interpolate etc.
GISDict = dict((x[0], GISData(*x)) for x in zip(start, shape, length, end))
# The dictionary for any given start value
# Access the 4 individual pieces by name, or by index
GISDict[start_lookup].shape
etc.
I have two text files both consisting of approximately 700,000 lines.
Second file consists of responses to statements in the first file for corresponding line.
I need to calculate Fisher's Exact Score for each word pair that appears on matching lines.
For example, if nth lines in the files are
how are you
and
fine thanx
then I need to calculate Fisher's score for (how,fine), (how,thanx), (are,fine), (are,thanx), (you,fine), (you,thanx).
In order to calculate Fisher's Exact Score, I used collections module's Counter to count the number of appearances of each word, and their co-appearances throughout the two files, as in
with open("finalsrc.txt") as f1, open("finaltgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(set(list(find_words(line1))))
words2 = list(set(list(find_words(line2))))
counts1.update(words1)
counts2.update(words2)
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
then I calculate the Fisher's exact score for each pair using scipy module by
from scipy import stats
def calculateFisher(s, t):
sa = counts1[s]
ta = counts2[t]
st = counts_pair[s, t]
snt = sa - st
nst = ta - st
nsnt = n - sa - ta + st
oddsratio, pvalue = stats.fisher_exact([[st, snt], [nst, nsnt]])
return pvalue
This works fast and fine for small text files,
but since my files contain 700,000 lines each, I think the Counter gets too large to retrieve the values quickly, and this becomes very very slow.
(Assuming 10 words per each sentence, the counts_pair would have (10^2)*700,000=70,000,000 entries.)
It would take tens of days to finish the computation for all word pairs in the files.
What would be the smart workaround for this?
I would greatly appreciate your help.
How exactly are you calling the calculateFisher function? Your counts_pair will not have 70 million entries: a lot of word pairs will be seen more than once, so seventy million is the sum of their counts, not the number of keys. You should be only calculating the exact test for pairs that do co-occur, and the best place to find those is in counts_pair. But that means that you can just iterate over it; and if you do, you never have to look anything up in counts_pair:
for (s, t), count in counts_pair.iteritems():
sa = counts1[s]
ta = counts2[t]
st = count
# Continue with Fisher's exact calculation
I've factored out the calculate_fisher function for clarity; I hope you get the idea. So if dictionary look-ups were what's slowing you down, this will save you a whole lot of them. If not, ... do some profiling and let us know what's really going on.
But note that simply looking up keys in a huge dictionary shouldn't slow things down too much. However, "retrieving values quickly" will be difficult if your program must to swap most of its data to disk. Do you have enough memory in your computer to hold the three counters simultaneously? Does the first loop complete in a reasonable amount of time? So find the bottleneck and you'll know more about what needs fixing.
Edit: From your comment it sounds like you are calculating Fisher's exact score over and over during a subsequent step of text processing. Why do that? Break up your program in two steps: First, calculate all word pair scores as I describe. Write each pair and score out into a file as you calculate it. When that's done, use a separate script to read them back in (now the memory contains nothing else but this one large dictionary of pairs & Fisher's exact scores), and rewrite away. You should do that anyway: If it takes you ten days just to get the scores (and you *still haven't given us any details on what's slow, and why), get started and in ten days you'll have them forever, to use whenever you wish.
I did a quick experiment, and a python process with a list of a million ((word, word), count) tuples takes just 300MB (on OS X, but the data structures should be about the same size on Windows). If you have 10 million distinct word pairs, you can expect it to take about 2.5 GB of RAM. I doubt you'll have even this many word pairs (but check!). So if you've got 4GB of RAM and you're not doing anything wrong that you haven't told us about, you should be all right. Otherwise, YMMV.
I think that your bottleneck is in how you manipulate the data structures other than the counters.
words1 = list(set(list(find_words(line1)))) creates a list from a set from a list from the result of find_words. Each of these operations requires allocating memory to hold all of your objects, and copying. Worse still, if the type returned by find_words does not include a __len__ method, the resulting list will have to grow and be recopied as it iterates.
I'm assuming that all you need is an iterable of unique words in order to update your counters, for which set will be perfectly sufficient.
for line1, line2 in itertools.izip(f1, f2):
words1 = set(find_words(line1)) # words1 now has list of unique words from line1
words2 = set(find_words(line2)) # words2 now has list of unique words from line2
counts1.update(words1) # counts1 increments words from line1 (once per word)
counts2.update(words2) # counts2 increments words from line2 (once per word)
counts_pair.update(itertools.product(words1, words2)
Note that you don't need to change the output of itertools.product that is passed to counts_pair as there are no repeated elements in words1 or words2, so the Cartesian product will not have any repeated elements.
Sounds like you need to generate the cross-products lazily - a Counter with 70 million elements will take a lot of RAM and suffer from cache misses on virtually every access.
So how about instead save a dict mapping a "file 1" word to a list of sets of corresponding "file 2" words?
Initial:
word_to_sets = collections.defaultdict(list)
Replace:
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
with:
for w1 in words1:
word_to_sets[w1].append(words2)
Then in your Fisher function, replace this:
st = counts_pair[s, t]
with:
st = sum(t in w2set for w2set in word_to_sets.get(s, []))
That's as lazy as I can get - the cross-products are never computed at all ;-)
EDIT Or map a "list 1" word to its own Counter:
Initial:
word_to_counter = collections.defaultdict(collections.Counter)
Replace:
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
with:
for w1 in words1:
word_to_counter[w1].update(words2)
In Fisher function:
st = word_to_counter[s][t]