Hashtables over large natural language word sets

Hashtables over large natural language word sets - python

I'm writing a program in python to do a unigram (and eventually bigram etc) analysis of movie reviews. The goal is to create feature vectors to feed into libsvm. I have 50,000 odd unique words in my feature vector (which seems rather large to me, but I ham relatively sure I'm right about that).
I'm using the python dictionary implementation as a hashtable to keep track of new words as I meet them, but I'm noticing an enormous slowdown after the first 1000 odd documents are processed. Would I have better efficiency (given the distribution of natural language) if I used several smaller hashtable/dictionaries or would it be the same/worse?
More info:
The data is split into 1500 or so documents, 500-ish words each. There are between 100 and 300 unique words (with respect to all previous documents) in each document.
My current code:
#processes each individual file, tok == filename, v == predefined class
def processtok(tok, v):
#n is the number of unique words so far,
#reference is the mapping reference in case I want to add new data later
#hash is the hashtable
#statlist is the massive feature vector I'm trying to build
global n
global reference
global hash
global statlist
cin=open(tok, 'r')
statlist=[0]*43990
statlist[0] = v
lines = cin.readlines()
for l in lines:
line = l.split(" ")
for word in line:
if word in hash.keys():
if statlist[hash[word]] == 0:
statlist[hash[word]] = 1
else:
hash[word]=n
n+=1
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
cin.close()
return statlist
Also keep in mind that my input data is about 6mb and my output data is about 300mb. I'm simply startled at how long this takes, and I feel that it shouldn't be slowing down so dramatically as it's running.
Slowing down: the first 50 documents take about 5 seconds, the last 50 take about 5 minutes.

#ThatGuy has made the fix, but hasn't actually told you this:
The major cause of your slowdown is the line
if word in hash.keys():
which laboriously makes a list of all the keys so far, then laboriously searches that list for `word'. The time taken is proportional to the number of keys i.e. the number of unique words found so far. That's why it starts fast and becomes slower and slower.
All you need is if word in hash: which in 99.9999999% of cases takes time independent of the number of keys -- one of the major reasons for having a dict.
The faffing about with statlist[hash[word]] doesn't help, either. By the way, the fixed size in statlist=[0]*43990 needs explanation.
More problems
Problem A: Either (1) your code suffered from indentation distortion when you published it, or (2) hash will never be updated by that function. Quite simply, if word is not in hash i.e it's the first time you've seen it, absolutely nothing happens. The hash[word] = n statement (the ONLY code that updates hash) is NOT executed. So no word will ever be in hash.
It looks like this block of code needs to be shifted left 4 columns, so that it's aligned with the outer if:
else:
hash[word]=n
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
Problem B: There is no code at all to update n (allegedly the number of unique words so far).
I strongly suggest that you take as many of the suggestions that #ThatGuy and I have made as you care to, rip out all the global stuff, fix up your code, chuck in a few print statements at salient points, and run it over say 2 documents each of 3 lines with about 4 words in each. Ensure that it is working properly. THEN run it on your big data set (with the prints suppressed). In any case you may want to put out stats (like number of documents, lines, words, unique words, elapsed time, etc) at regular intervals.
Another problem
Problem C: I mentioned this in a comment on #ThatGuy's answer, and he agreed with me, but you haven't mentioned taking it up:
>>> line = "foo bar foo\n"
>>> line.split(" ")
['foo', 'bar', 'foo\n']
>>> line.split()
['foo', 'bar', 'foo']
>>>
Your use of .split(" ") will lead to spurious "words" and distort your statistics, including the number of unique words that you have. You may well find the need to change that hard-coded magic number.
I say again: There is no code that updates n in the function . Doing hash[word] = n seems very strange, even if n is updated for each document.

I don't think Python's Dictionary has anything to do with your slowdown here. Especially when you are saying that the entries are around 100. I am hoping that you are referring to Insertion and Retrival, which are both O(1) in a dictionary. The problem could be that you are not using iterators (or loading key,value pairs one at a time) when creating a dictionary and you are loading the entire words in-memory. In that case, the slowdown is due to memory consumption.

I think you've got a few problems going on here. Mostly, I am unsure of what you are tying to accomplish with statlist. It seems to me like it is serving as a poor duplicate of your dictionary. Create it after you have found all of your words.
Here is my guess as to what you want:
def processtok(tok, v):
global n
global reference
global hash
cin=open(tok, 'rb')
for l in cin:
line = l.split(" ")
for word in line:
if word in hash:
hash[word] += 1
else:
hash[word] = 1
n += 1
ref.write('['+str(word)+','+str(n)+']'+'\n')
cin.close()
return hash
Note, that this means you no longer need an "n" as you can discover this by doing len(n).

Related

How to detect the amount of almost-repetition in a text file?

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.
The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).
Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.
u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]
x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)
f = (y - z)[: -5]
plt.plot(f)
plt.show();
So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...
import difflib
import numpy as np
lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]
n = 3
d = []
for i in range(len(lines)):
a = lines[i:i+n]
for j in range(len(lines)):
b = lines[j:j+n]
if i == j: continue # skip same line
group_size = np.sum([len(x) for x in a])
if group_size < 5: continue # skip short lines
ratio = 0
for u, v in zip(a, b):
r = difflib.SequenceMatcher(None, u, v).ratio()
ratio += r if r > 0.7 else 0
d.append(ratio)
dry = sum(d) / len(lines)
In the following, we can identify some repetition at a glance:
w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()
Why not using:
d = np.exp(np.array(d))
Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.
What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):
I am sure there is a more clever solution out there.
Any hint?

I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.
Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.

Looks like you're choosing a long path. I wouldn't go there.
I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.
Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.
Dis would be an interesting option.
I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.
On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.
Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.
With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.

You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.
As such, we can write this trivial implementation:
import operator
class Repeat:
def __init__(self,l):
self.l=list(l)
self.memo={}
def __call__(self,m,n):
l=self.l
memo=self.memo
k=m,n
ret=memo.get(k)
if not ret:
if not m or not n: ret=0,None
elif m!=n and l[m-1]==l[n-1]: # critical change here!
z,tail=self(m-1,n-1)
ret=z+1,((m-1,n-1),tail)
else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
memo[k]=ret
return ret
def go(self):
n=len(self.l)
v=self(n,n)[1]
ret=[]
while v:
x,v=v
ret.append(x)
ret.reverse()
return ret
def repeat(l): return Repeat(l).go()
You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.
Of course, there are also "real" answers and real research on this issue!

Frame challenge: I’m not sure you should do this
It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…
That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.
More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.
Edit:
Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.

I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.
For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).
As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))
Algorithm:
read text line by line
transform it by
removing blanks
removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text
Code:
from itertools import groupby, cycle
import re
def sequences(l):
x2 = cycle(l)
next(x2)
grps = groupby(l, key=lambda j: j + 1 == next(x2))
yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)
with open('program.cpp') as fp:
text = fp.readlines()
# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
line = re.sub(r"\s+", "", line, flags=re.UNICODE)
if line:
processed.append(line)
text_map[proc_ix] = ix
proc_ix += 1
# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
h = hash(''.join(processed[ix:ix+3])) # join 3 lines
hashes.append(h)
hpos.setdefault(h, []).append(ix) # this list will reflect lines that are duplicated
# filter duplicated three liners
seqs = []
for k, v in hpos.items():
if len(v) > 1:
seqs.extend(v)
seqs = sorted(list(set(seqs)))
# find longer sequences
result = {}
for seq in sequences(seqs):
result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))
print('Duplicates found:')
for v in result.values():
print('-'*20)
vbeg, vend = v[0]
print(''.join(text[vbeg:vend]))
print(f'Found {len(v)} duplicates, lines')
for line_numbers in v:
print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Recursive python matching algorithm based on subsets working too slowly

I'm building a web app to match high school students considering a gap year to students who have taken a gap year, based on interest as denoted by tags. A prototype is up at covidgapyears.com. I have never written a matching/recommendation algorithm, so though people have suggested things like collaborative filtering and association rule mining, or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset (few hundred users right now, few thousand soon). So I wrote my own alg using common sense.
It essentially takes in a list of tags that the student is interested it, then searches for an exact match of those tags with someone who has taken a gap year and registered with the site (who also selected tags on registration). An exactMatch, as given below, is when the tags the user specifies are ALL contained by some profile (i.e., are a subset). If it can't find an exact match with ALL of the user's inputted tags, it will check all n-1 length subsets of the tags list itself to see if any less selective queries have matches. It does this recursively until at least 3 matches are found. While it works fine for small tags selections (up to 5-7) it gets slow for larger tags selections (7-13), taking several seconds to return a result. When 11-13 tags are selected, hits a Heroku error due to worker timeout.
I did some tests by putting variables inside the algorithm to count computations and it seems that when it goes a bit deep into the recursive stack, it checks a few hundred subsets each time (to see if there's an exactMatch for that subset, and if there is, add it to results list to output), and the total number of computations doubles as you add one more tag (it went 54, 150, 270, 500, 1000, 1900, 3400 operations for more and more tags). It is true that there are a few hundred subsets at each depth. But exactMatches is O(1) as I've written it (no iteration), and aside from the other O(1) operations like IF, the FOR inside the subset loop will, at most, be gone through around 10 times. This agrees with the measured result of a few thousand computations each time.
This did not surprise me as selecting and iterating over all subsets seems to be something that could get non harder, but my question is about why it's so slow despite only doing a few thousand computations. I know my computer operates in GHz and I expect web servers are similar, so surely a few thousand computations would be near-instantaneous? What am I missing and how can I improve this algorithm? Any other approaches I should look into?
# takes in a list of length n and returns a list of all combos of subsets of depth n
def arbSubsets(seq, n):
return list(itertools.combinations(seq, len(seq)-n))
# takes in a tagsList and check Gapper.objects.all to see if any gapper has all those tags
def exactMatches(tagsList):
tagsSet = set(tagsList)
exactMatches = []
for gapper in Gapper.objects.all():
gapperSet = set(gapper.tags.names())
if tagsSet.issubset(gapperSet):
exactMatches.append(gapper)
return exactMatches
# takes in tagsList that has been cleaned to remove any tags that NO gappers have and then checks gapper objects to find optimal match
def matchGapper(tagsList, depth, results):
# handles the case where we're only given tags contained by no gappers
if depth == len(tagsList):
return []
# counter variable is to measure complexity for debugging
counter += 1
# we don't want too many results or it stops feeling tailored
upper_limit_results = 3
# now we must check subsets for match
subsets = arbSubsets(tagsList, depth)
for subset in subsets:
counter += 1
matches = exactMatches(subset)
if matches:
for match in matches:
counter += 1
# new need to check because we might be adding depth 2 to results from depth 1
# which we didn't do before, to make sure we have at least 3 results
if match not in results:
# don't want to show too many or it doesn't feel tailored anymore
counter += 1
if len(results) > upper_limit_results: break
results.append(match)
# always give at least 3 results
if len(results) > 2:
return results
else:
# check one level deeper (less specific) into tags if not enough gappers that match to get more results
counter += 1
return matchGapper(tagsList, depth + 1, results)
# this is the list of matches we then return to the user
matches = matchGapper(tagsList, 0, [])

It doesn't seem you are doing a few hundred computation steps. In fact you have a few hundred options for each depth, thus you should not add, but multiply the number of steps at each depth to estimate the complexity of your solution.
Additionally this statement: This or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset is also obviously not true. Although these algorithms may be overkill for some very simple cases, they are still valid and will work for them.

Okay, so after much fiddling with timers I've figured it out. There are a few functions at play when matching: exactMatches, matchGapper and arbSubset. When I put the counter into a global variable and measured operations (as measured as lines of my code being executed, it came in around 2-10K for large inputs (around 10 tags)).
It is true that arbSubset, which returns a list of subsets, at first seems like a plausible bottleneck. But if you look closely, we are 1) handling small amounts of tags (order of 10-50) and more importantly, 2) we are only calling arbSubset when we recurse matchGapper, which only happens a max of about 10 times, since tagsList can only be around 10 (order of 10-50, as above). And when I checked the time it took to generate arbSubsets, it was order of 2e-5. And so the total time spend on generating the subsets of arbitrary size is only 2e-4. In other words, not the source of the 5-30 second waiting time in the web app.
And so with that aside, knowing that arbSubset is only called on the order of 10 times, and is fast at that, and knowing that there are only around a max of 10K computations taking place in my code it starts to become clear that I must be using some out-of-the-box function, I don't know--like set() or .issubset() or something like that--that takes a nontrivial amount of time to compute, and is executed many times. Adding some counters in some more places, it becomes clear that exactMatch() accounts for around 95-99% of all computations that take place (as would be expected if we have to check all combinations of subsets of various sizes for exactMatches).
So the problem, at this point, is reduced to the fact that exactMatch takes around 0.02s (empirically) as implemented, and is called several thousand times. And so we can either try to make it faster by a couple of order of magnitudes (it's already pretty optimal), or take another approach that doesn't involve finding matches using subsets. A friend of mine suggested creating a dict with all the combinations of tags (so 2^len(tagsList) keys) and setting them equal to lists of registered profiles with that exact combination. This way, querying is just traversing a (huge) dict, which can be done fast. Any other suggestions are welcome.

python script for removing duplicates taking 24 hrs+ to loop through 10^7 records

input t1
P95P,71655,LINC-JP,pathogenic
P95P,71655,LINC-JP,pathogenic
P71P,71655,LINC-JP,pathogenic
P71P,71655,LINC-JP,pathogenic
output op
P95P,71655,LINC-JP,pathogenic
P71P,71655,LINC-JP,pathogenic
myCode
def dup():
fi=open("op","a")
l=[];final="";
q=[];dic={};
for i in open("t1"):
k=i.split(",")
q.append(k[1])
q.append(k[0])
if q in l:
pass
else:
final= final + i.strip() + "\n"
fi.write(str(i.strip()))
fi.write("\n")
l.append(q)
q=[]
#print i.strip()
fi.close()
return final.strip()
d=dup()
In the above input line1,2 and line 3,4 are duplicates. Hence in output these duplicates are removed the entries in my input files are around 10^7.
Why is my code running since past 24 hrs for an input of 76Mb file. Also it has yet to complete one iteration of entire input file.It works fine for small files.
Can anyone please point out the reason for this long time.How can I optimize my program ? thnx

You're using an O(n2) algorithm, which scales poorly for larger files:
for i in open("t1"): # linear pass of file takes O(n) time
...
if q in l: # linear pass of list l takes O(n) time
...
...
You should consider using a set (i.e. make l a set) or itertools.groupby if duplicates will always be next to each other. These approaches will be O(n).

if you have access to a Unix system, uniq is a nice utility that is made for your problem.
uniq input.txt output.txt
see https://www.cs.duke.edu/csl/docs/unix_course/intro-89.html
I know this is a Python question, but sometimes Python is not the tool for the task.
And you can always embed a system call in your python script.

It's not clear why you're building a huge string (final) that holds the same thing the file does, or what dic is for. In terms of performance, you can look up x in y much faster if y is a set than if y is a list. Also, a minor point; shorter variable names don't improve performance, so use good ones instead. I would suggest:
def deduplicate(infile, outfile):
seen = set()
#final = []
with open(outfile, "a") as out, open(infile) as in_:
for line in in_:
check = tuple(line.split(",")[:2])
if check not in seen:
#final.append(line.strip())
out.write(line) # why 'strip' the '\n' then 'write' a new one?
seen.add(check)
#return "\n".join(final)
If you do really need final, make it a list until the last moment (see commented-out lines) - gradual string concatenation means the creation of lots of unnecessary objects.

There are a couple things that you are doing very inefficiently. The largest is that you made l a list, so the line if q in l has to search through everything in the list already in order to check if q matches it. If you make l a set, the membership check can be done using a hash calculation and array lookup, which take the same (small) amount of time no matter how much you add to the set (though it will cause l not to be read in the order that it was written).
Other little speedups that you can do include:
Using a tupple (k[1], k[0]) instead of a list for q.
You are writing your output file fi every loop. Your OS will try to batch and background the writes, but it may be faster to just do one big write at the end. I am not sure on this point but try it.

fast update in mongo db

here's my problem. I want to make a collection in mongodb where I have a word and the number of the times it occurs. I'm doing it in python and it's extrememly slow. It's most probably because for every word that i have, I check if it is already in the database (using
*find_one*)and if yes, get its frequency, increment it and store it back (using update) Of course, when the word is not there, I append it to a list and do bulk insert periodically.
Is there a better way of doing this? The number of words is huge (different languages possible). Is mongoDB the right thing to use in the first place? I chose mongoDB because it was very easy to install to and I picked up the tutorial in 10 min...
edit - added the code as well. When I say large, I mean a file that is some 4 GB large with words in them...
insertlist = []
def copy_to_db(word):
global insertlist
wordCollection = db['words']
occurrence = wordCollection.find_one({'word' : word})
if occurrence:
n = occurrence['number']
n = n + 1
wordCollection.update({'word' : word}, {'$set' : {'number' : n}})
else:
insertlist.append({'word' : word, 'number' : 1})
#wordCollection.insert({'word' : word, 'number' : 1})
if len(insertlist) >= 5000:
print("insert triggered ... ")
wordCollection.insert(insertlist)
insertlist = []
i call this func. for every word.

Sounds like you could use upserts. If you use upserts you don't need to do that fetch/save cycle.
I'm not sure how this is done in python driver, but in JavaScript it would look something like:
db.words.update({"_id": "the_word" }, {"$inc": {"frequency": 1}}, true)
MongoDB creates an index for the _id field automatically. If you are not using the _id field for your word, then creating an index for your key would most probably help a lot.
Edit: some more ideas for you
As there is a lot of data, you could use the _id field for your word. This way you wouldn't need to create another index and the updates would be slightly faster as only one index needs to be updated while inserting new documents. This is in case inserting speed is the bottleneck.
While taking advantage of batch inserts is generally a good idea when inserting a lot of data, I'm not sure if it helps too much on this case. This depends on your data. If the ratio of unique words is high, then batch inserts might be handy. But if the same words are used over and over again (which I guess is the case with most languages), then batch inserts might not be of too much help.
Also, it looks like you have a problem in your batch insert. Think about if you encounter a words for the first time. It is inserted to your insertlist. Now, if this same word is encountered again while the previous batch is not inserted, the number attribute for this word would be 1 which would be incorrect.
Are you sure the db is the bottleneck? Have you already made sure that the is no other poorly performing code? But anyway I guess, inserting 4GB of data will take a while in any case.

Working with suffix trees in python

I'm relatively new to python and am starting to work with suffix trees. I can build them, but I'm running into a memory issue when the string gets large. I know that they can be used to work with DNA strings of size 4^10 or 4^12, but whenever I try to implement a method, I end up with a memory issue.
Here is my code for generating the string and the suffix tree.
import random
def get_string(length):
string=""
for i in range(length):
string += random.choice("ATGC")
return string
word=get_string(4**4)+"$"
def suffixtree(string):
for i in xrange(len(string)):
if tree.has_key(string[i]):
tree[string[i]].append([string[i+1:]][0])
else:
tree[string[i]]=[string[i+1:]]
return tree
tree={}
suffixtree(word)
When I get up to around 4**8, I run into severe memory problems. I'm rather new to this so I'm sure I'm missing something with storing these things. Any advice would be greatly appreciated.
As a note: I want to do string searching to look for matching strings in a very large string. The search string match size is 16. So, this would look for a string of size 16 within a large string, and then move onto the next string and perform another search. Since I'll be doing a very large number of searches, a suffix tree was suggested.
Many thanks

This doesn't look like a tree to me. It looks like you are generating all possible suffixes, and storing them in a hashtable.
You will likely get much smaller memory performance if you use an actual tree. I suggest using a library implementation.

As others have said already, the data structure you are building is not a suffix tree. However, the memory issues stem largely from the fact that your data structure involves a lot of explicit string copies. A call like this
string[i+1:]
creates an actual (deep) copy of the substring starting at i+1.
If you are still interested in constructing your original data structure (whatever its use may be), a good solution is to use buffers instead of string copies. Your algorithm would then look like this:
def suffixtree(string):
N = len(string)
for i in xrange(N):
if tree.has_key(string[i]):
tree[string[i]].append(buffer(string,i+1,N))
else:
tree[string[i]]=[buffer(string,i+1,N)]
return tree
I tried this embedded in the rest of your code, and confirmed that it requires significantly less then 1 GB of main memory even at a total length of 8^11 characters.
Note that this will likely be relevant even if you switch to an actual suffix tree. A correct suffix tree implementation will not store copies (not even buffers) in the tree edges; however, during tree construction you might need a lot of temporary copies of the strings. Using the buffer type for these is a very good idea to avoid putting a heavy burden on the garbage collector for all the unnecessary explicit string copies.

The reason you get memory problems is that for input 'banana' you are generating {'b': ['anana$'], 'a': ['nana$', 'na$', '$'], 'n': ['ana$', 'a$']}. That isn't a tree structure. You have every possible suffix of the input created and stored in one of the lists. That takes O(n^2) storage space. Also, for a suffix tree to work properly, you want the leaf nodes to give you index positions.
The result you want to get is {'banana$': 0, 'a': {'$': 5, 'na': {'$': 3, 'na$': 1}}, 'na': {'$': 4, 'na$': 2}}. (This is an optimized representation; a simpler approach limits us to single-character labels.)

If your memory problems lie in creating the suffix tree, are you sure you need one? You could find all matches in a single string like this:
word=get_string(4**12)+"$"
def matcher(word, match_string):
positions = [-1]
while 1:
positions.append(word.find(match_string, positions[-1] + 1))
if positions[-1] == -1:
return positions[1:-1]
print matcher(word,'AAAAAAAAAAAA')
[13331731, 13331732, 13331733]
print matcher('AACTATAAATTTACCA','AT')
[4, 8]
My machine is pretty old, and this took 30 secs to run, with 4^12 string. I used a 12 digit target so there would be some matches. Also this solution will find overlapping results - should there be any.
Here is a suffix tree module you could try, like this:
import suffixtree
stree = suffixtree.SuffixTree(word)
print stree.find_substring("AAAAAAAAAAAA")
Unfortunetly, my machine is too slow to test this out properly with long strings. But presumably once the suffixtree is built the searches will be very fast, so for large amounts of searches it should be a good call. Further find_substring only returns the first match (don't know if this is an issue, I'm sure you could adapt it easily).
Update: Split the string into smaller suffix trees, thus avoiding memory problems
So if you need to do 10 million searches on 4^12 length string, we clearly do not want to wait for 9.5 years (standard simple search, I first suggested, on my slow machine...). However, we can still use suffix trees (thus being a lot quicker), AND avoid the memory issues. Split the large string into manageable chunks (which we know the machines memory can cope with) and turn a chunk into a suffix tree, search it 10 million times, then discard that chunk and move onto the next one. We also need to remember to search the overlap between each chunk. I wrote some code to do this (It assumes the large string to be searched, word is a multiple of our maximum manageable string length, max_length, you'll have to adjust the code to also check the remainder at the end, if this is not the case):
def split_find(word,search_words,max_length):
number_sub_trees = len(word)/max_length
matches = {}
for i in xrange(0,number_sub_trees):
stree = suffixtree.SuffixTree(word[max_length*i:max_length*(i+1)])
for search in search_words:
if search not in matches:
match = stree.find_substring(search)
if match > -1:
matches[search] = match + max_length*i,i
if i < number_sub_trees:
match = word[max_length*(i+1) - len(search):max_length*(i+1) + len(search)].find(search)
if match > -1:
matches[search] = match + max_length*i,i
return matches
word=get_string(4**12)
search_words = ['AAAAAAAAAAAAAAAA'] #list of all words to find matches for
max_length = 4**10 #as large as your machine can cope with (multiple of word)
print split_find(word,search_words,max_length)
In this example I limit the max suffix tree length to length 4^10, which needs about 700MB.
Using this code, for one 4^12 length string, 10 million searches should take around 13 hours (full searches, with zero matches, so if there are matches it will be quicker). However, as part of this we need to build 100 suffix trees, which will take around..100*41sec= 1 hour.
So the total time to run is around 14 hours, without memory issues... Big improvement on 9.5 years.
Note that I am running this on a 1.6GHz CPU with 1GB RAM, so you ought to be able to do way better than this!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.