How to detect the amount of almost-repetition in a text file?

How to detect the amount of almost-repetition in a text file? - python

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.
The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).
Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.
u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]
x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)
f = (y - z)[: -5]
plt.plot(f)
plt.show();
So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...
import difflib
import numpy as np
lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]
n = 3
d = []
for i in range(len(lines)):
a = lines[i:i+n]
for j in range(len(lines)):
b = lines[j:j+n]
if i == j: continue # skip same line
group_size = np.sum([len(x) for x in a])
if group_size < 5: continue # skip short lines
ratio = 0
for u, v in zip(a, b):
r = difflib.SequenceMatcher(None, u, v).ratio()
ratio += r if r > 0.7 else 0
d.append(ratio)
dry = sum(d) / len(lines)
In the following, we can identify some repetition at a glance:
w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()
Why not using:
d = np.exp(np.array(d))
Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.
What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):
I am sure there is a more clever solution out there.
Any hint?

I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.
Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.

Looks like you're choosing a long path. I wouldn't go there.
I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.
Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.
Dis would be an interesting option.
I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.
On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.
Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.
With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.

You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.
As such, we can write this trivial implementation:
import operator
class Repeat:
def __init__(self,l):
self.l=list(l)
self.memo={}
def __call__(self,m,n):
l=self.l
memo=self.memo
k=m,n
ret=memo.get(k)
if not ret:
if not m or not n: ret=0,None
elif m!=n and l[m-1]==l[n-1]: # critical change here!
z,tail=self(m-1,n-1)
ret=z+1,((m-1,n-1),tail)
else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
memo[k]=ret
return ret
def go(self):
n=len(self.l)
v=self(n,n)[1]
ret=[]
while v:
x,v=v
ret.append(x)
ret.reverse()
return ret
def repeat(l): return Repeat(l).go()
You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.
Of course, there are also "real" answers and real research on this issue!

Frame challenge: I’m not sure you should do this
It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…
That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.
More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.
Edit:
Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.

I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.
For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).
As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))
Algorithm:
read text line by line
transform it by
removing blanks
removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text
Code:
from itertools import groupby, cycle
import re
def sequences(l):
x2 = cycle(l)
next(x2)
grps = groupby(l, key=lambda j: j + 1 == next(x2))
yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)
with open('program.cpp') as fp:
text = fp.readlines()
# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
line = re.sub(r"\s+", "", line, flags=re.UNICODE)
if line:
processed.append(line)
text_map[proc_ix] = ix
proc_ix += 1
# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
h = hash(''.join(processed[ix:ix+3])) # join 3 lines
hashes.append(h)
hpos.setdefault(h, []).append(ix) # this list will reflect lines that are duplicated
# filter duplicated three liners
seqs = []
for k, v in hpos.items():
if len(v) > 1:
seqs.extend(v)
seqs = sorted(list(set(seqs)))
# find longer sequences
result = {}
for seq in sequences(seqs):
result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))
print('Duplicates found:')
for v in result.values():
print('-'*20)
vbeg, vend = v[0]
print(''.join(text[vbeg:vend]))
print(f'Found {len(v)} duplicates, lines')
for line_numbers in v:
print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Related

Bloom Filters with the Kirsch Mitzenmacher optimization

I have recently started playing around with Bloom Filters, and I have a use case in which this calculator suggests to use 15 different and unrelated hash functions.
Hashing 15 times a string would be quite computing-intensive, so I started looking for a better solution.
Gladly, I encountered the Kirsch-Mitzenmacher optimization, which suggests the following: hash_i = hash1 + i * hash2, 0 ≤ i ≤ k - 1, where K is the number of suggested hash functions.
In my specific case, I am trying to crate a 0.5MiB Bloom Filter that should eventually be able to efficiently store 200,000 (200k) pieces of information. With 15 hash functions, this would result in p = 0.000042214, or 1 in 23,689 (23k).
I tried implementing the optimization in python as follows:
import uuid
import hashlib
import mmh3
m = 4194304 # 0.5 MiB in bit
k = 15
bit_vector = [0] * m
data = str(uuid.uuid4()).encode()
hash1 = int(hashlib.sha256(data).hexdigest(), base=16)
hash2 = mmh3.hash(data)
for _ in range(0, k-1):
hash1 += hash2
bit = hash1 % m
bit_vector[bit] = 1
But it seems to get far worse results when compared to creating a classic Bloom Filter with only two hashes (it performs 10x worse in terms of false positives).
What am I doing wrong here? Am I completely misinterpreting the optimization - hence using the formula in the wrong way - or am I just using the wrong "parts" of the hashes (here I am using the full hashes in the calculation, maybe I should be using the first 32 bits on one and the latter 32 bits of the other)? Or maybe the problem is me encoding the data before hashing it?
I also found this git repository that talks about Bloom Filters and the optimization, and their results when using it are quite good both in terms of computing time and in false positives quantity.
*In the example I am using python, as that's the programming language I'm most comfortable with to test things out, but in my project I am actually going to use JavaScript to perform the whole process. Help in any programming language of your choice is much appreciated regardless.

The Kirsch-Mitzenmacher optimization is a proof of concept paper, and as such assumes a bunch of requirements on the table size and the hash functions themselves. Using it naively has stumbled people before you too. There is a bunch of practical stuff to consider. You can check out "6.5.1 Double hashing" of the P.C.Dillinger's thesis, linked in his long comment as well, in which he explains the issues and offers solutions. His solutions to rocksdb implementation issues can also be interesting.
Maybe start by trying the "enhanced double hashing":
hashes = []
for i in range(0, k-1):
hash2 += i
hashes.append(hash1)
hash1 += hash2
(I don't know why your example code only generates 1 hash as result, and assume you pasted simplified code. 15 hash functions should set 15 bits.)
If this doesn't work, and you are exhausted of reading/trying more solutions - say "it's not worth it" at this point, and choose to find a library with a good implementation instead, compare them for speed/accuracy. Unfortunately, I don't konw any to recommend. Maybe the rocksdb implementation, since we know an expert has worked on it?
And even that may not be worth the effort, and calling mmh3.hash(data, i) with 15 seeds be reasonably fast enough.

You are indeed misinterpreting the Kirsch-Mitzenmacher optimization. You are combining the hash1 and hash2 functions, plus all your k functions, into what amounts to a single, more complicated hash function.
The point of the optimization is to use each of these as separate hash functions. You don't merge the k bit results together in any way, you set each of them separately in your bit vector. In other words, you should have something roughly along the lines of:
bit_vector[hash1] = 1
bit_vector[hash2] = 1
for i in range(0, k-1):
bit_vector[hash1 + i * hash2] = 1
See how I'm setting a bit for every hash function? You're just combining them all. That ruins the whole point of using multiple different "independent" hash functions, which is really necessary for good result from a bloom filter.

Faster similarity clustering in python

I have a collection of a few thousand strings (DNA sequences). I want to trim this down to a couple of hundred (exact number not critical) by excluding sequences that are very similar.
I can do this via matching using the "Levenshtein" module. It works, but it's pretty slow, and I'm pretty sure there must be a much faster way. The code here is the same approach but applied to words, to make it more testable; for me with this cutoff it takes about 10 seconds and collects ~1000 words.
import Levenshtein as lev
import random
f = open("/usr/share/dict/words", 'r')
txt = f.read().splitlines()
f.close()
cutoff = .5
collected = []
while len(txt) > 0:
a = random.choice(txt)
collected.append(a)
txt = filter( lambda b: lev.ratio(a,b) < cutoff, txt)
I've tried a few different variations and some other matching modules (Jellyfish) without getting significantly faster.

Maybe you could use locality-sensitve hashing. You could hash all strings into buckets so that all strings in one bucket are very similar to each other. Than you can pick just one of each bucket.
I've never used this personally (don't know how good it works in your case), it's just an idea.
related: Python digest/hash for string similarity

I'm not quite sure, what your application aims at, but when working with sequences you might want to consider using Biopython. An alternative way to calculate the distance between two DNA snipplets is using an alignment score (which is related to Levenshtein by non-constant alignement weights). Using biopython you could do a multiple sequence alignment and create a phylogenetic tree. If you'd like to have a faster solution, use BLAST which is a heuristical approach. This would be consistent solutions while yours depends on the random choice of the input sequences.
Concerning your initial question, there's no 'simple' solution to speed up your code.

Get the most relevant word (spell check) from 'enchant suggest()' in Python

I want to get the most relevant word from enchant suggest(). Is there any better way to do that. I feel my function is not efficient when it comes to checking large set of words in the range of 100k or more.
Problem with enchant suggest():
>>> import enchant
>>> d.suggest("prfomnc")
['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']
My function to get the appropriate word from a set of suggested words:
import enchant, difflib
word="prfomnc"
dict,max = {},0
a = set(d.suggest(word))
for b in a:
tmp = difflib.SequenceMatcher(None, word, b).ratio();
dict[tmp] = b
if tmp > max:
max = tmp
print dict[max]
Result: performance
Updated:
if I get multiple keys, meaning same difflib ratio() values, I use multi-key dictionary. As explained here: http://code.activestate.com/recipes/440502-a-dictionary-with-multiple-values-for-each-key/

No magic bullet, I'm afraid... a few suggestions however.
I'm guessing that most of the time in the logic is spent in the difflib's SequenceMatcher().ratio() call. This wouldn't be surprising since this method uses a variation on the Rattcliff-Obershelp algorithm which is relatively expensive, CPU-wise (but the metric it produces is rather "on the mark" to locate close matches, and that is probably why you like it).
To be sure, you should profile this logic and confirm that indeed SequenceMatcher() is the hot spot. Maybe Enchant.suggest() is also a bit slow, but there would be little we could do, code-wise, to improve this (configuration-wise, there may be a few options, for eg. doing away with personal dictionary to save the double look-upup and merge etc.).
Assuming that SequenceMatcher() is indeed the culprit, and assuming that you wish to stick with the Ratcliff-Obershelp similarity metric as the way to select the best match, you could do [some of] the following:
only compute the SequenceMatcher ratio value for the top (?) 5 items from Enchant. After all, Enchant.suggest() returns its suggestions in an ordered fashion with its best guesses first; therefore while based on different heuristics, there's value in the Enchant order as well, the chances of finding a hi-ranking match probably diminish as we move down the list. Also, even though, we may end up ignoring a few such hi-ranking matches, by testing only the top few Enchant suggestions, we somehow combine the "wisdom" found in Enchant's heuristics with these from the Ratcliff-Obershelp metric.
stop computing the SequenceMatcher ratio after a certain threshold has been reached
The idea is similar to the preceding: avoid calling SequenceMatcher once the odds of finding better are getting smaller (and once we have a decent if not best choice in hand)
filter out some of the words from Enchant with your own logic.
The idea is to have a relatively quick/inexpensive test which may tell us that a given word is unlikely to score well on the SequenceMatcher ratio. For example exclude words which do not have at least, say, length of user string minus two characters in common.
BTW, you can maybe use some of the SequenceMatcher object's [quicker] functions to get some data for the filtering heuristics.
use SequenceMatcher *quick_ratio*() function instead
at least in some cases.
only keep the best match, in a string, rather than using a dictionary
Apparently only the top choice matters, so except for test purposes you may not need the [relatively small] overhead of the dictionary.
you may consider writing your own Ratcliff-Obershelp (or similar) method, introducing therein various early exits when the prospect of meeting the current max ratio is small. BEWARE, it would likely be difficult to produce a method as efficient as the C-language one of difflib, your interest in doing this is with the early exits...
HTH, good luck ;-)

You don't actuall need to keep a dict if you are only interested in the best matches
>>> word="prfomnc"
>>> best_words = []
>>> best_ratio = 0
>>> a = set(d.suggest(word))
>>> for b in a:
... tmp = difflib.SequenceMatcher(None, word, b).ratio()
... if tmp > best_ratio:
... best_words = [b]
... best_ratio = tmp
... elif tmp == best_ratio:
... best_words.append(b)
...
>>> best_words
['performance']

Hashtables over large natural language word sets

I'm writing a program in python to do a unigram (and eventually bigram etc) analysis of movie reviews. The goal is to create feature vectors to feed into libsvm. I have 50,000 odd unique words in my feature vector (which seems rather large to me, but I ham relatively sure I'm right about that).
I'm using the python dictionary implementation as a hashtable to keep track of new words as I meet them, but I'm noticing an enormous slowdown after the first 1000 odd documents are processed. Would I have better efficiency (given the distribution of natural language) if I used several smaller hashtable/dictionaries or would it be the same/worse?
More info:
The data is split into 1500 or so documents, 500-ish words each. There are between 100 and 300 unique words (with respect to all previous documents) in each document.
My current code:
#processes each individual file, tok == filename, v == predefined class
def processtok(tok, v):
#n is the number of unique words so far,
#reference is the mapping reference in case I want to add new data later
#hash is the hashtable
#statlist is the massive feature vector I'm trying to build
global n
global reference
global hash
global statlist
cin=open(tok, 'r')
statlist=[0]*43990
statlist[0] = v
lines = cin.readlines()
for l in lines:
line = l.split(" ")
for word in line:
if word in hash.keys():
if statlist[hash[word]] == 0:
statlist[hash[word]] = 1
else:
hash[word]=n
n+=1
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
cin.close()
return statlist
Also keep in mind that my input data is about 6mb and my output data is about 300mb. I'm simply startled at how long this takes, and I feel that it shouldn't be slowing down so dramatically as it's running.
Slowing down: the first 50 documents take about 5 seconds, the last 50 take about 5 minutes.

#ThatGuy has made the fix, but hasn't actually told you this:
The major cause of your slowdown is the line
if word in hash.keys():
which laboriously makes a list of all the keys so far, then laboriously searches that list for `word'. The time taken is proportional to the number of keys i.e. the number of unique words found so far. That's why it starts fast and becomes slower and slower.
All you need is if word in hash: which in 99.9999999% of cases takes time independent of the number of keys -- one of the major reasons for having a dict.
The faffing about with statlist[hash[word]] doesn't help, either. By the way, the fixed size in statlist=[0]*43990 needs explanation.
More problems
Problem A: Either (1) your code suffered from indentation distortion when you published it, or (2) hash will never be updated by that function. Quite simply, if word is not in hash i.e it's the first time you've seen it, absolutely nothing happens. The hash[word] = n statement (the ONLY code that updates hash) is NOT executed. So no word will ever be in hash.
It looks like this block of code needs to be shifted left 4 columns, so that it's aligned with the outer if:
else:
hash[word]=n
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
Problem B: There is no code at all to update n (allegedly the number of unique words so far).
I strongly suggest that you take as many of the suggestions that #ThatGuy and I have made as you care to, rip out all the global stuff, fix up your code, chuck in a few print statements at salient points, and run it over say 2 documents each of 3 lines with about 4 words in each. Ensure that it is working properly. THEN run it on your big data set (with the prints suppressed). In any case you may want to put out stats (like number of documents, lines, words, unique words, elapsed time, etc) at regular intervals.
Another problem
Problem C: I mentioned this in a comment on #ThatGuy's answer, and he agreed with me, but you haven't mentioned taking it up:
>>> line = "foo bar foo\n"
>>> line.split(" ")
['foo', 'bar', 'foo\n']
>>> line.split()
['foo', 'bar', 'foo']
>>>
Your use of .split(" ") will lead to spurious "words" and distort your statistics, including the number of unique words that you have. You may well find the need to change that hard-coded magic number.
I say again: There is no code that updates n in the function . Doing hash[word] = n seems very strange, even if n is updated for each document.

I don't think Python's Dictionary has anything to do with your slowdown here. Especially when you are saying that the entries are around 100. I am hoping that you are referring to Insertion and Retrival, which are both O(1) in a dictionary. The problem could be that you are not using iterators (or loading key,value pairs one at a time) when creating a dictionary and you are loading the entire words in-memory. In that case, the slowdown is due to memory consumption.

I think you've got a few problems going on here. Mostly, I am unsure of what you are tying to accomplish with statlist. It seems to me like it is serving as a poor duplicate of your dictionary. Create it after you have found all of your words.
Here is my guess as to what you want:
def processtok(tok, v):
global n
global reference
global hash
cin=open(tok, 'rb')
for l in cin:
line = l.split(" ")
for word in line:
if word in hash:
hash[word] += 1
else:
hash[word] = 1
n += 1
ref.write('['+str(word)+','+str(n)+']'+'\n')
cin.close()
return hash
Note, that this means you no longer need an "n" as you can discover this by doing len(n).

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...

done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']

What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]

I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).

Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)

Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.