How to count frequency in gensim.Doc2Vec? - python

I am training a model with gensim, my corpus is many short sentences, and each sentence has a frequency which indicates times it occurs in total corpus. I implement it as follow, as you can see, I just choose to do repeat freq times. Any way, if the data is small, it should work, but when data grows, the frequency can be very large, it costs too much memory and my machine cannot afford it.
So
1. can I just count the frequency in every record instead of repeat freq times? 2. Or any other ways to save memory?
class AddressSentences(object):
def __init__(self, raw_path, path):
self._path = path
def __iter__(self):
with open(self.path) as fi:
headers = next(fi).split(",")
i_address, i_freq = headers.index("address"), headers.index("freq")
index = 0
for line in fi:
cols = line.strip().split(",")
freq = cols[i_freq]
address = cols[i_address].split()
# Here I do repeat
for i in range(int(freq)):
yield TaggedDocument(address, [index])
index += 1
print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())
corpus is something like this:
address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300

Two options for your exact question:
(1)
You don't need to reify your corpus iterator into a fully in-memory list, with your line:
train_corpus = list(AddressSentences("/data/corpus.csv"))
The gensim Word2Vec model can use your iterable-object directly as its corpus, since it implements __iter__() (and thus can be iterated over multiple times). So you can just do:
train_corpus = AddressSentences("/data/corpus.csv")
Then each line will be read, and each repeated TaggedDocument re-yield()ed, without requiring the full set in memory.
(2)
Alternatively, in such cases you may sometimes just want to write a separate routine that takes your original file, and rather than directly yielding TaggedDocuments, does the repetition to create a tangible file on disk which includes the repetitions. Then, use a more simple iterable reader to stream that (already-repeated) dataset into your model.
A negative of this approach, in this particular case, is that it would increase the amount of (likely relatively laggy) disk-IO. However, if the special processing your iterator is doing is more costly – such as regex-based tokenization – this sort of process-and-rewrite can help avoid duplicate work by the model later. (The model needs to scan your corpus once for vocabulary-discovery, then again iter times for training – so any time-consuming work in your iterator will be done redundantly, and may be the bottleneck that keeps other training threads idle waiting for data.)
But after those two options, some Doc2Vec-specific warnings:
Repeating documents like this may not benefit the Doc2Vec model, as compared to simply iterating over the full diverse set. It's the tug-of-war interplay of contrasting examples which cause the word-vectors/doc-vectors in Word2Vec/Doc2Vec models to find useful relative arrangements.
Repeating exact documents/word-contexts is a plausible way to "overweight" those examples, but even if that's really what you want, and would help your end-goals, it'd be better to shuffle those repeats through the whole set.
Repeating one example consecutively is like applying the word-cooccurrences of that example like a jackhammer on the internal neural-network, without any chance for interleaved alternate examples to find a mutually-predictive weight arrangement. The iterative gradient-descent optimization through all diverse examples ideally works more like gradual water-driven erosion & re-deposition of values.
That suggests another possible reason to take the second approach, above: after writing the file-with-repeats, you could use an external line-shuffling tool (like sort -R or shuf on Linux) to shuffle the file. Then, the 1000 repeated lines of some examples would be evenly spread among all the other (repeated) examples, a friendlier arrangement for dense-vector learning.
In any case, I would try leaving out repetition entirely, or shuffling repetitions, and evaluate which steps are really helping on whatever the true end goal is.

Related

Gensim Word2Vec exhausting iterable

I'm getting the following prompt when calling model.train() from gensim word2vec
INFO : EPOCH 0: training on 0 raw words (0 effective words) took 0.0s, 0 effective words/s
The only solutions I found on my search for an answer point to the itarable vs iterator difference, and at this point, I tried everything I could to solve this on my own, currently, my code looks like this:
class MyCorpus:
def __init__(self, corpus):
self.corpus = corpus.copy()
def __iter__(self):
for line in self.corpus:
x = re.sub("(<br ?/?>)|([,.'])|([^ A-Za-z']+)", '', line.lower())
yield utils.simple_preprocess(x)
sentences = MyCorpus(corpus)
w2v_model = Word2Vec(
sentences = sentences,
vector_size = w2v_size,
window = w2v_window,
min_count = w2v_min_freq,
workers = -1
)
The corpus variable is a list containing sentences, and each sentence is a string.
I tried the numerous "tests" to see if my class is indeed iterable, like:
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
print(sum(1 for _ in sentences))
For instance, all of them suggest that my class is iterable, so at this point, I think the problem must be something else.
workers=-1 is not a supported value for Gensim's Word2Vec model; it essentially means you're using no threads.
Instead, you must specify the actual number of worker threads you'd like to use.
When using an iterable corpus, the optimal number of workers is usually some number up to your number of CPU cores, but not higher than 8-12 if you've got 16+ cores, because of some hard-to-remove inefficiencies in both the Python's Global Interpreter Lock ("GIL") and the Gensim master-reader-thread approach.
Generally, also, you'll get better throughput if your iterable isn't doing anything expensive or repetitive in its preprocessing - like any regex-based tokenization, or a tokenization that's repeated on every epoch. So best to do such preprocessing once, writing the resulting simple space-delimited tokens to a new file. Then, read that file with a very-simple, no-regex, space-splitting only tokenization.
(If performance becomes a major concern on a large dataset, you can also look into the alternate corpus_file method of specifying your corpus. It expects a single file, where each text is on its own line, and tokens are already just space-delimited. But it then lets every worker thread read its own range of the file, with far less GIL/reader-thread bottlenecking, so using workers equal to the CPU core count is then roughly optimal for throughput.)

How to detect the amount of almost-repetition in a text file?

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.
The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).
Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.
u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]
x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)
f = (y - z)[: -5]
plt.plot(f)
plt.show();
So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...
import difflib
import numpy as np
lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]
n = 3
d = []
for i in range(len(lines)):
a = lines[i:i+n]
for j in range(len(lines)):
b = lines[j:j+n]
if i == j: continue # skip same line
group_size = np.sum([len(x) for x in a])
if group_size < 5: continue # skip short lines
ratio = 0
for u, v in zip(a, b):
r = difflib.SequenceMatcher(None, u, v).ratio()
ratio += r if r > 0.7 else 0
d.append(ratio)
dry = sum(d) / len(lines)
In the following, we can identify some repetition at a glance:
w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()
Why not using:
d = np.exp(np.array(d))
Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.
What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):
I am sure there is a more clever solution out there.
Any hint?
I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.
Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.
Looks like you're choosing a long path. I wouldn't go there.
I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.
Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.
Dis would be an interesting option.
I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.
On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.
Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.
With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.
You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.
As such, we can write this trivial implementation:
import operator
class Repeat:
def __init__(self,l):
self.l=list(l)
self.memo={}
def __call__(self,m,n):
l=self.l
memo=self.memo
k=m,n
ret=memo.get(k)
if not ret:
if not m or not n: ret=0,None
elif m!=n and l[m-1]==l[n-1]: # critical change here!
z,tail=self(m-1,n-1)
ret=z+1,((m-1,n-1),tail)
else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
memo[k]=ret
return ret
def go(self):
n=len(self.l)
v=self(n,n)[1]
ret=[]
while v:
x,v=v
ret.append(x)
ret.reverse()
return ret
def repeat(l): return Repeat(l).go()
You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.
Of course, there are also "real" answers and real research on this issue!
Frame challenge: I’m not sure you should do this
It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…
That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.
More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.
Edit:
Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.
I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.
For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).
As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))
Algorithm:
read text line by line
transform it by
removing blanks
removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text
Code:
from itertools import groupby, cycle
import re
def sequences(l):
x2 = cycle(l)
next(x2)
grps = groupby(l, key=lambda j: j + 1 == next(x2))
yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)
with open('program.cpp') as fp:
text = fp.readlines()
# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
line = re.sub(r"\s+", "", line, flags=re.UNICODE)
if line:
processed.append(line)
text_map[proc_ix] = ix
proc_ix += 1
# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
h = hash(''.join(processed[ix:ix+3])) # join 3 lines
hashes.append(h)
hpos.setdefault(h, []).append(ix) # this list will reflect lines that are duplicated
# filter duplicated three liners
seqs = []
for k, v in hpos.items():
if len(v) > 1:
seqs.extend(v)
seqs = sorted(list(set(seqs)))
# find longer sequences
result = {}
for seq in sequences(seqs):
result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))
print('Duplicates found:')
for v in result.values():
print('-'*20)
vbeg, vend = v[0]
print(''.join(text[vbeg:vend]))
print(f'Found {len(v)} duplicates, lines')
for line_numbers in v:
print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Recursive python matching algorithm based on subsets working too slowly

I'm building a web app to match high school students considering a gap year to students who have taken a gap year, based on interest as denoted by tags. A prototype is up at covidgapyears.com. I have never written a matching/recommendation algorithm, so though people have suggested things like collaborative filtering and association rule mining, or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset (few hundred users right now, few thousand soon). So I wrote my own alg using common sense.
It essentially takes in a list of tags that the student is interested it, then searches for an exact match of those tags with someone who has taken a gap year and registered with the site (who also selected tags on registration). An exactMatch, as given below, is when the tags the user specifies are ALL contained by some profile (i.e., are a subset). If it can't find an exact match with ALL of the user's inputted tags, it will check all n-1 length subsets of the tags list itself to see if any less selective queries have matches. It does this recursively until at least 3 matches are found. While it works fine for small tags selections (up to 5-7) it gets slow for larger tags selections (7-13), taking several seconds to return a result. When 11-13 tags are selected, hits a Heroku error due to worker timeout.
I did some tests by putting variables inside the algorithm to count computations and it seems that when it goes a bit deep into the recursive stack, it checks a few hundred subsets each time (to see if there's an exactMatch for that subset, and if there is, add it to results list to output), and the total number of computations doubles as you add one more tag (it went 54, 150, 270, 500, 1000, 1900, 3400 operations for more and more tags). It is true that there are a few hundred subsets at each depth. But exactMatches is O(1) as I've written it (no iteration), and aside from the other O(1) operations like IF, the FOR inside the subset loop will, at most, be gone through around 10 times. This agrees with the measured result of a few thousand computations each time.
This did not surprise me as selecting and iterating over all subsets seems to be something that could get non harder, but my question is about why it's so slow despite only doing a few thousand computations. I know my computer operates in GHz and I expect web servers are similar, so surely a few thousand computations would be near-instantaneous? What am I missing and how can I improve this algorithm? Any other approaches I should look into?
# takes in a list of length n and returns a list of all combos of subsets of depth n
def arbSubsets(seq, n):
return list(itertools.combinations(seq, len(seq)-n))
# takes in a tagsList and check Gapper.objects.all to see if any gapper has all those tags
def exactMatches(tagsList):
tagsSet = set(tagsList)
exactMatches = []
for gapper in Gapper.objects.all():
gapperSet = set(gapper.tags.names())
if tagsSet.issubset(gapperSet):
exactMatches.append(gapper)
return exactMatches
# takes in tagsList that has been cleaned to remove any tags that NO gappers have and then checks gapper objects to find optimal match
def matchGapper(tagsList, depth, results):
# handles the case where we're only given tags contained by no gappers
if depth == len(tagsList):
return []
# counter variable is to measure complexity for debugging
counter += 1
# we don't want too many results or it stops feeling tailored
upper_limit_results = 3
# now we must check subsets for match
subsets = arbSubsets(tagsList, depth)
for subset in subsets:
counter += 1
matches = exactMatches(subset)
if matches:
for match in matches:
counter += 1
# new need to check because we might be adding depth 2 to results from depth 1
# which we didn't do before, to make sure we have at least 3 results
if match not in results:
# don't want to show too many or it doesn't feel tailored anymore
counter += 1
if len(results) > upper_limit_results: break
results.append(match)
# always give at least 3 results
if len(results) > 2:
return results
else:
# check one level deeper (less specific) into tags if not enough gappers that match to get more results
counter += 1
return matchGapper(tagsList, depth + 1, results)
# this is the list of matches we then return to the user
matches = matchGapper(tagsList, 0, [])
It doesn't seem you are doing a few hundred computation steps. In fact you have a few hundred options for each depth, thus you should not add, but multiply the number of steps at each depth to estimate the complexity of your solution.
Additionally this statement: This or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset is also obviously not true. Although these algorithms may be overkill for some very simple cases, they are still valid and will work for them.
Okay, so after much fiddling with timers I've figured it out. There are a few functions at play when matching: exactMatches, matchGapper and arbSubset. When I put the counter into a global variable and measured operations (as measured as lines of my code being executed, it came in around 2-10K for large inputs (around 10 tags)).
It is true that arbSubset, which returns a list of subsets, at first seems like a plausible bottleneck. But if you look closely, we are 1) handling small amounts of tags (order of 10-50) and more importantly, 2) we are only calling arbSubset when we recurse matchGapper, which only happens a max of about 10 times, since tagsList can only be around 10 (order of 10-50, as above). And when I checked the time it took to generate arbSubsets, it was order of 2e-5. And so the total time spend on generating the subsets of arbitrary size is only 2e-4. In other words, not the source of the 5-30 second waiting time in the web app.
And so with that aside, knowing that arbSubset is only called on the order of 10 times, and is fast at that, and knowing that there are only around a max of 10K computations taking place in my code it starts to become clear that I must be using some out-of-the-box function, I don't know--like set() or .issubset() or something like that--that takes a nontrivial amount of time to compute, and is executed many times. Adding some counters in some more places, it becomes clear that exactMatch() accounts for around 95-99% of all computations that take place (as would be expected if we have to check all combinations of subsets of various sizes for exactMatches).
So the problem, at this point, is reduced to the fact that exactMatch takes around 0.02s (empirically) as implemented, and is called several thousand times. And so we can either try to make it faster by a couple of order of magnitudes (it's already pretty optimal), or take another approach that doesn't involve finding matches using subsets. A friend of mine suggested creating a dict with all the combinations of tags (so 2^len(tagsList) keys) and setting them equal to lists of registered profiles with that exact combination. This way, querying is just traversing a (huge) dict, which can be done fast. Any other suggestions are welcome.

Working on google collab with python : garbage collector is not working?

I am working on google collab using python and I have a 12Gb Ram.
I am trying to use word2vec pre-trained by google to represent sentences by vectors.
I should have same length vectors even if they do not have the same number of words so I used padding (the maximum length of a sentence here is my variable max)
The problem is that every time I want to create a matrix containing all of my vectors i run out of RAM memory quickly (on 20k th / 128k vector)
This is my code :
final_x_train = []
l=np.zeros((max,300)) # The legnth of a google pretained model is 300
for i in new_X_train:
buildWordVector(final_x_train, i, model, l)
gc.collect() #doesn't do anything except slowing the run time
def buildWordVector(new_X, sent, model, l):
for x in range(len(sent)):
try:
l[x]= list(model[sent[x]])
gc.collect() #doesn't do anything except slowing the run time
except KeyError:
continue
new_X.append([list(x) for x in l])
all the variable that i have :
df: 16.8MiB
new_X_train: 1019.1KiB
X_train: 975.5KiB
y_train: 975.5KiB
new_X_test: 247.7KiB
X_test: 243.9KiB
y_test: 243.9KiB
l: 124.3KiB
final_x_train: 76.0KiB
stop_words: 8.2KiB
But I am at 12Gb/12Gb (RAM) and the session has expired
As you can see the garbage collector is not doing anything because apperently is cannot see the variables but I really need a solution to solve this problem can anyone help me please?
In general in a garbage-collected language like Python you don't need to explicitly request garbage-collection: it happens automatically when you've stopped retaining references (variables/transitive-property-references) to objects.
So, if you're getting a memory error here, it's almost certainly because you're really trying to use more than the available amount of memory at a time.
Your code is a bit incomplete and unclear – what is max? what is new_X_train? where are you getting those memory sizing estimates? etc.
But notably: it's not typical to represent a sentence as a concatenation of each word's vector. (So that, with 300d word-vectors, and an up-to-10-word sentence, you have a 3000d sentence-vector.) It's far more common to average the word-vectors together, so both words and sentences have the same size, and there's no blank padding at the end of short sentences.
(That's still a very crude way to create text-vectors, but more common than padding-to-maximum-sentence-size.)

Detect bigram collection on a very large text dataset

I would like to find bigrams in a large corpus in text format. As the corpus cannot be loaded at once in memory and its lines are very big, I load it by chunks, each 1 kb
def read_in_chunks(filename, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = filename.read(chunk_size)
if not data:
break
yield data
Then I want to go piece by piece through the corpus and find bigrams and I use the gensim Phrases() and Phraser() functions, but while training, my model constantly loses state. Thus, I tried to save and reload the model after each megabyte that I read and then free the memory, but it still loses state. My code is here:
with open("./final/corpus.txt", "r", encoding='utf8') as r:
max_vocab_size=20000000
phrases = Phrases(max_vocab_size=max_vocab_size)
i=1
j=1024
sentences = ""
for piece in read_in_chunks(r):
if i<=j:
sentences = sentences + piece
else:
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
phrases = phrases.save('./final/phrases.txt')
phrases = Phraser.load('./final/phrases.txt')
sentences = ""
j+=1024
i+=1
print("Done")
Any suggestion?
Thank you.
When you do the two lines...
phrases.add_vocab(sentences)
phrases = Phrases(sentences)
...that 2nd line is throwing away any existing instance inside the phrases variable, and replacing it with a brand new instance (Phrases(sentences)). There's no chance for additive adjustment to the single instance.
Secondarily, there's no way two consecutive lines of .save()-then-immediate-re-.load() can save net memory usage. At the very best, the .load() would be unnnecessary, only exactly reproducing what was just .save()d, but wasting a lot of time and temporary memory loading a second copy, then discarding the duplicate that was already in phrases to assign phrases to the new clone.
While these are problems, more generally, the issue is that what you need done doesn't have to be this complicated.
The Phrases class will accept as its corpus of sentences an iterable object where each item is a list-of-string-tokens. You don't have to worry about chunk-sizes, and calling add_vocab() multiple times – you can just provide a single object that itself offers up each item in turn, and Phrases will do the right thing. You do have to worry about breaking up raw lines into the specific words you want considered ('tokenization').
(For a large corpus, you might still run into memory issues related to the number of unique words that Phrases is trying to count. But it won't matter how arbitrarily large the number of items is – because it will only look at one at a time. Only the accumulation of unique words will consume running memory.)
For a good intro to how an iterable object can work in such situations, a good blog post is:
Data streaming in Python: generators, iterators, iterables
If your corpus.txt file is already set up to be one reasonably-sized sentence per line, and all words are already delimited by simple spaces, then an iterable class might be as simple as:
class FileByLineIterable(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
with open(self.filename, 'r', encoding='utf8') as src:
for line in src.readlines():
yield line.split()
Then, your code might just be as simple as...
sentences = FileByLineIterable('./final/corpus.txt')
phrases = Phrases(sentences, max_vocab_size=max_vocab_size)
...because the Phrases class is getting what it wants – a corpus that offers via iteration just one list-of-words item at a time.
Note:
you may want to enable logging at the INFO level to monitor progress and watch for any hints of things going wrong
there's a slightly more advanced line-by-line iterator, which also limits any one line's text to no more than 10,000 tokens (to match an internal implementation limit of gensim Word2Vec), and opens files from places other than local file-paths, available at gensim.models.word2vec.LineSentence. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence
(Despite this being packaged in the word2vec package, it can be used elsewhere.)

Categories

Resources