loop fails when increasing string from elements in list - python

I have some problem with sequence generator. I have a file where each line contain one fragment (8 letters). I load it from file in to list, where each element is one fragment. It is DNA so it should go that way:
1. Takes first 8-letter element
2. Check for element in which first 7 letters is the same as last 7 letters in first.
3. Add 8th letter from second element in to sequence.
It should look like this:
ATTGCCAT
TTGCCATA
TGCAATAC
So sequence: ATTGCCATAC
Unfortunately it only add one element. :( First element is given (we knew it). I do it that way its first in file (first line).
Here is the code:
from os import sys
import random
def frag_get(seqfile):
frags = []
f_in = open(seqfile, "r")
for i in f_in.readlines():
frags.append(i.strip())
f_in.close()
return frags
def frag_list_shuffle(frags):
random.shuffle(frags)
return frags
def seq_build(first, frags):
seq = first
for f in frags:
if seq[-7:] == f[:7]:
seq += f[-1:]
return seq
def errors():
pass
if __name__ == "__main__":
frags = frag_get(sys.argv[1])
first = frags[0]
frags.remove(first)
frags = frag_list_shuffle(frags)
seq = seq_build(first, frags)
check(sys.argv[2], seq)
spectrum(sys.argv[2], sys.argv[3])
I have deleted check and spectrum functions because it's simple calculations e.g. length comparison, so it is not what cause a problem as I think.
I will be very thankfully for help!
Regards,
Mateusz

Because your fragments are shuffled, your algorithm needs to take that into account; currently, you're just looping through the fragments once, which is unlikely to include more than a few fragments if they're not in the right order. For example, say you have 5 fragments, which I'm going to refer to by their order in your sequence. Now the fragments are slightly out of order:
1 - 3 - 2 - 4 - 5
Your algorithm will start with 1, skip 3, then match on 2, adding a base at the end. Then it'll check against 4 and 5, and then finish, never reaching fragment 3.
You could easily fix this by starting your loop again each time you add a base, however, this will scale very badly for a large number of bases. Instead, I'd recommend loading your fragments into a trie, and then searching the trie for the next fragment each time you add a base, until you've added one base for each fragment or you can no longer find a matching fragment.

works for me:
>>> seq = "ATTGCCAT"
>>> frags = ["TTGCCATA", "TGCCATAC"]
>>> for f in frags:
... if seq[-7:] == f[:7]:
... seq += f[-1:]
...
>>> seq
'ATTGCCATAC'
You have a spelling error in your example, TGCAATAC should be TGCCATAC. But fixing that it works.

For fun and interest, I've rewritten the problem using OO. See what you think:
import collections
import sys
import random
usage = """
Usage:
sequence fname expected
Where
fname: name of file containing fragments
expected: result-string which should be obtained by chaining from first fragment.
"""
class Frag(str):
MATCHLEN = 7
def __new__(cls, s=''):
return str.__new__(cls, s.strip())
def head(self):
return Frag(self[:Frag.MATCHLEN])
def tail(self):
return Frag(self[Frag.MATCHLEN:])
def nexthead(self):
return Frag(self[-Frag.MATCHLEN:])
def check(self, s):
return self.__eq__(s)
def __add__(self, s):
return Frag(str(self).__add__(s))
class Fraglist(list):
#classmethod
def fromFile(cls, fname):
with open(fname, "r") as inf:
lst = [Frag(ln) for ln in inf]
return cls(lst)
def shuffle(self):
random.shuffle(self)
class Sequencer(object):
def __init__(self, seq=None):
super(Sequencer, self).__init__()
self.sequences = collections.defaultdict(list)
if seq is not None:
for frag in seq:
self.sequences[frag.head()].append(frag.tail())
def build(self, frag):
res = [frag]
match = frag.nexthead()
while match in self.sequences:
next = random.choice(self.sequences[match])
res.append(next)
match = (match + next).nexthead()
return Frag(''.join(res))
def main():
if len(sys.argv) != 3:
print usage
sys.exit(-1)
else:
fname = sys.argv[1]
expected = sys.argv[2]
frags = Fraglist.fromFile(fname)
frag1 = frags.pop(0)
frags.shuffle()
seq = Sequencer(frags)
result = seq.build(frag1)
if result.check(expected):
print "Match!"
else:
print "No match"
if __name__=="__main__":
main()

Related

Is it possible to get the source code of a (possibly decorated) Python function body, including inline comments? [duplicate]

This question already has answers here:
How can I get the source code of a Python function?
(13 answers)
Closed 1 year ago.
I am trying to figure out how to only get the source code of the body of the function.
Let's say I have:
def simple_function(b = 5):
a = 5
print("here")
return a + b
I would want to get (up to indentation):
"""
a = 5
print("here")
return a + b
"""
While it's easy in the case above, I want it to be agnostic of decorators/function headers, etc. However, still include inline comments. So for example:
#decorator1
#decorator2
def simple_function(b: int = 5):
""" Very sophisticated docs
"""
a = 5
# Comment on top
print("here") # And in line
return a + b
Would result in:
"""
a = 5
# Comment on top
print("here") # And in line
return a + b
"""
I was not able to find any utility and have been trying to play with inspect.getsourcelines for few hours now, but with no luck.
Any help appreciated!
Why is it different from How can I get the source code of a Python function?
This question asks for a whole function source code, which includes both decorators, docs, def, and body itself. I'm interested in only the body of the function.
I wrote a simple regex that does the trick. I tried this script with classes and without. It seemed to work fine either way. It just opens whatever file you designate in the Main call, at the bottom, rewrites the entire document with all function/method bodies doc-stringed and then save it as whatever you designated as the second argument in the Main call.
It's not beautiful, and it could probably have more efficient regex statements. It works though. The regex finds everything from a decorator (if one) to the end of a function/method, grouping tabs and the function/method body. It then uses those groups in finditer to construct a docstring and place it before the entire chunk it found.
import re
FUNC_BODY = re.compile(r'^((([ \t]+)?#.+\n)+)?(?P<tabs>[\t ]+)?def([^\n]+)\n(?P<body>(^([\t ]+)?([^\n]+)\n)+)', re.M)
BLANK_LINES = re.compile(r'^[ \t]+$', re.M)
class Main(object):
def __init__(self, file_in:str, file_out:str) -> None:
#prime in/out strings
in_txt = ''
out_txt = ''
#open resuested file
with open(file_in, 'r') as f:
in_txt = f.read()
#remove all lines that just have space characters on them
#this stops FUNC_BODY from finding the entire file in one shot
in_txt = BLANK_LINES.sub('', in_txt)
last = 0 #to keep track of where we are in the file
#process all matches
for m in FUNC_BODY.finditer(in_txt):
s, e = m.span()
#make sure we catch anything that was between our last match and this one
out_txt = f"{out_txt}{in_txt[last:s]}"
last = e
tabs = m.group('tabs') if not m.group('tabs') is None else ''
#construct the docstring and inject it before the found function/method
out_txt = f"{out_txt}{tabs}'''\n{m.group('body')}{tabs}'''\n{m.group()}"
#save as requested file name
with open(file_out, 'w') as f:
f.write(out_txt)
if __name__ == '__main__':
Main('test.py', 'test_docd.py')
EDIT:
Apparently, I "missed the entire point" so I wrote it again a different way. Now you can get the body while the code is running and decorators don't matter, at all. I left my other answer here because it is also a solution, just not a "real time" one.
import re, inspect
FUNC_BODY = re.compile('^(?P<tabs>[\t ]+)?def (?P<name>[a-zA-Z0-9_]+)([^\n]+)\n(?P<body>(^([\t ]+)?([^\n]+)\n)+)', re.M)
class Source(object):
#staticmethod
def investigate(focus:object, strfocus:str) -> str:
with open(inspect.getsourcefile(focus), 'r') as f:
for m in FUNC_BODY.finditer(f.read()):
if m.group('name') == strfocus:
tabs = m.group('tabs') if not m.group('tabs') is None else ''
return f"{tabs}'''\n{m.group('body')}{tabs}'''"
def decorator(func):
def inner():
print("I'm decorated")
func()
return inner
#decorator
def test():
a = 5
b = 6
return a+b
print(Source.investigate(test, 'test'))

Fixing Python Code

I'm trying to implement an iterator class named CharCounter. This class opens a textfile and provides an iterator that returns words from the text file containing a user specified number of characters. It should output a word per line. Which is not what's it's doing, it's outputting the words as a list and then it continuously outputs 'a'. How can I fix my code?
class CharCounter(object):
def __init__(self, fileNm, strlen):
self._fileNm = fileNm
self._strlen = strlen
fw = open(fileNm)
text = fw.read()
lines = text.split("\n")
words = []
pwords =[]
for each in lines:
words += each.split(" ")
chkEnd = ["'",'"',",",".",")","("]
if words[-1] in chkEnd:
words = words.rstrip()
for each in words:
if len(each) == strlen:
pwords.append(each)
print(pwords)
def __iter__(self):
return CharCounterIterator(self._fileNm)
class CharCounterIterator(object):
def __init__(self,fileNm):
self._fileNm = fileNm
self._index = 0
def __iter__(self):
return self
def next(self):
try:
ret = self._fileNm[self._index]
return ret
except IndexError:
raise StopIteration
if __name__=="__main__":
for word in CharCounter('agency.txt',11):
print "%s" %word
Code posted on SO should not read a file unless the question is about reading files. The result cannot be duplicated and verified. (See MCVE.) Instead, define a text string as a stand-in for the file.
Your code prints the words of length n as a list because that is what you ask it to do with print(pwords). It repeatedly prints the first char of the filename because that is what you ask it to do in the __next__ method.
Your class __init__ does more than you describe. The attempt to strip punctuation from words does not do anything. The code below defines a class that turns a text into a list of stripped words (with duplicates). It also defines a parameterized generator method that filters the word list.
class Words:
def __init__(self, text):
self.words = words = []
for line in text.split('\n'):
for word in line.split():
words.append(word.strip(""",'."?!()[]{}*$#"""))
def iter_n(self, n):
for word in self.words:
if len(word) == n:
yield word
# Test
text = """
It should output a word per line.
Which is not what's it's doing!
(It outputs the words as a [list] and then continuously outputs 'a'.)
How can I fix my #*!code?
"""
words = Words(text)
for word in words.iter_n(5):
print(word)
# Prints
Which
doing
words

Iterator example from Dive Into Python 3

I'm learning Python as my 1st language from http://www.diveintopython3.net/. On Chp 7, http://www.diveintopython3.net/iterators.html, there is an example of how to use an iterator.
import re
def build_match_and_apply_functions(pattern, search, replace):
def matches_rule(word):
return re.search(pattern, word)
def apply_rule(word):
return re.sub(search, replace, word)
return [matches_rule, apply_rule]
class LazyRules:
rules_filename = 'plural6-rules.txt'
def __init__(self):
self.pattern_file = open(self.rules_filename, encoding='utf-8')
self.cache = []
def __iter__(self):
self.cache_index = 0
return self
def __next__(self):
self.cache_index += 1
if len(self.cache) >= self.cache_index:
return self.cache[self.cache_index - 1]
if self.pattern_file.closed:
raise StopIteration
line = self.pattern_file.readline()
if not line:
self.pattern_file.close()
raise StopIteration
pattern, search, replace = line.split(None, 3)
funcs = build_match_and_apply_functions(
pattern, search, replace)
self.cache.append(funcs)
return funcs
rules = LazyRules()
def plural(noun):
for matches_rule, apply_rule in rules:
if matches_rule(noun):
return apply_rule(noun)
if __name__ == '__main__':
import sys
if sys.argv[1:]:
print(plural(sys.argv[1]))
else:
print(__doc__)
My question is: how does the 'for matches_rule, apply_rule in rules:' loop in the plural(noun) function know when to exit after fulfilling the if condition? There are no StopIteration commands for that condition. I would expect the for loop to continue until the rules.cache is iterated completely.
Thank you for the help!
The return statement ends the function at that point, returning a value to the caller. This can be relied upon in almost any situation (if you have a try..except..else..finally structure, even a return statement won't prevent the finally block from being executed).

'MarkovGenerator' object has no attribute 'together'

I come up with a problem about the class and I don't know the reason, does anyone can help me out?
The problem is in def together(), here are my code.
class MarkovGenerator(object):
def __init__(self, n, max):
self.n = n # order (length) of ngrams
self.max = max # maximum number of elements to generate
self.ngrams = dict() # ngrams as keys; next elements as values
beginning = tuple(["That", "is"]) # beginning ngram of every line
beginning2 = tuple(["on", "the"])
self.beginnings = list()
self.beginnings.append(beginning)
self.beginnings.append(beginning2)
self.sentences = list()
def tokenize(self, text):
return text.split(" ")
def feed(self, text):
tokens = self.tokenize(text)
# discard this line if it's too short
if len(tokens) < self.n:
return
# store the first ngram of this line
#beginning = tuple(tokens[:self.n])
#self.beginnings.append(beginning)
for i in range(len(tokens) - self.n):
gram = tuple(tokens[i:i+self.n])
next = tokens[i+self.n] # get the element after the gram
# if we've already seen this ngram, append; otherwise, set the
# value for this key as a new list
if gram in self.ngrams:
self.ngrams[gram].append(next)
else:
self.ngrams[gram] = [next]
# called from generate() to join together generated elements
def concatenate(self, source):
return " ".join(source)
# generate a text from the information in self.ngrams
def generate(self,i):
from random import choice
# get a random line beginning; convert to a list.
#current = choice(self.beginnings)
current = self.beginnings[i]
output = list(current)
for i in range(self.max):
if current in self.ngrams:
possible_next = self.ngrams[current]
next = choice(possible_next)
output.append(next)
# get the last N entries of the output; we'll use this to look up
# an ngram in the next iteration of the loop
current = tuple(output[-self.n:])
else:
break
output_str = self.concatenate(output)
return output_str
def together(self):
return "lalala"
if __name__ == '__main__':
import sys
import random
generator = MarkovGenerator(n=2, max=16)
for line in open("us"):
line = line.strip()
generator.feed(line)
for i in range(2):
print generator.generate(i)
print generator.together()
But I got the error saying:
Traceback (most recent call last):
File "markovoo2.py", line 112, in <module>
print generator.together()
AttributeError: 'MarkovGenerator' object has no attribute 'together'
Does anyone know know the reason?
You have indented the def together() function definition too far, it is part of the def generate() function body.
Un-indent it to match the other functions in the class body.
It looks your def together is indented too deeply. It is inside the generate method. Move it out one indentation level.

Efficiently match multiple regexes in Python

Lexical analyzers are quite easy to write when you have regexes. Today I wanted to write a simple general analyzer in Python, and came up with:
import re
import sys
class Token(object):
""" A simple Token structure.
Contains the token type, value and position.
"""
def __init__(self, type, val, pos):
self.type = type
self.val = val
self.pos = pos
def __str__(self):
return '%s(%s) at %s' % (self.type, self.val, self.pos)
class LexerError(Exception):
""" Lexer error exception.
pos:
Position in the input line where the error occurred.
"""
def __init__(self, pos):
self.pos = pos
class Lexer(object):
""" A simple regex-based lexer/tokenizer.
See below for an example of usage.
"""
def __init__(self, rules, skip_whitespace=True):
""" Create a lexer.
rules:
A list of rules. Each rule is a `regex, type`
pair, where `regex` is the regular expression used
to recognize the token and `type` is the type
of the token to return when it's recognized.
skip_whitespace:
If True, whitespace (\s+) will be skipped and not
reported by the lexer. Otherwise, you have to
specify your rules for whitespace, or it will be
flagged as an error.
"""
self.rules = []
for regex, type in rules:
self.rules.append((re.compile(regex), type))
self.skip_whitespace = skip_whitespace
self.re_ws_skip = re.compile('\S')
def input(self, buf):
""" Initialize the lexer with a buffer as input.
"""
self.buf = buf
self.pos = 0
def token(self):
""" Return the next token (a Token object) found in the
input buffer. None is returned if the end of the
buffer was reached.
In case of a lexing error (the current chunk of the
buffer matches no rule), a LexerError is raised with
the position of the error.
"""
if self.pos >= len(self.buf):
return None
else:
if self.skip_whitespace:
m = self.re_ws_skip.search(self.buf[self.pos:])
if m:
self.pos += m.start()
else:
return None
for token_regex, token_type in self.rules:
m = token_regex.match(self.buf[self.pos:])
if m:
value = self.buf[self.pos + m.start():self.pos + m.end()]
tok = Token(token_type, value, self.pos)
self.pos += m.end()
return tok
# if we're here, no rule matched
raise LexerError(self.pos)
def tokens(self):
""" Returns an iterator to the tokens found in the buffer.
"""
while 1:
tok = self.token()
if tok is None: break
yield tok
if __name__ == '__main__':
rules = [
('\d+', 'NUMBER'),
('[a-zA-Z_]\w+', 'IDENTIFIER'),
('\+', 'PLUS'),
('\-', 'MINUS'),
('\*', 'MULTIPLY'),
('\/', 'DIVIDE'),
('\(', 'LP'),
('\)', 'RP'),
('=', 'EQUALS'),
]
lx = Lexer(rules, skip_whitespace=True)
lx.input('erw = _abc + 12*(R4-623902) ')
try:
for tok in lx.tokens():
print tok
except LexerError, err:
print 'LexerError at position', err.pos
It works just fine, but I'm a bit worried that it's too inefficient. Are there any regex tricks that will allow me to write it in a more efficient / elegant way ?
Specifically, is there a way to avoid looping over all the regex rules linearly to find one that fits?
I suggest using the re.Scanner class, it's not documented in the standard library, but it's well worth using. Here's an example:
import re
scanner = re.Scanner([
(r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
(r"-?[0-9]+", lambda scanner, token: int(token)),
(r" +", lambda scanner, token: None),
])
>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]
You can merge all your regexes into one using the "|" operator and let the regex library do the work of discerning between tokens. Some care should be taken to ensure the preference of tokens (for example to avoid matching a keyword as an identifier).
I found this in python document. It's just simple and elegant.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(s):
keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
token_specification = [
('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
('ASSIGN', r':='), # Assignment operator
('END', r';'), # Statement terminator
('ID', r'[A-Za-z]+'), # Identifiers
('OP', r'[+*\/\-]'), # Arithmetic operators
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]'), # Skip over spaces and tabs
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
get_token = re.compile(tok_regex).match
line = 1
pos = line_start = 0
mo = get_token(s)
while mo is not None:
typ = mo.lastgroup
if typ == 'NEWLINE':
line_start = pos
line += 1
elif typ != 'SKIP':
val = mo.group(typ)
if typ == 'ID' and val in keywords:
typ = val
yield Token(typ, val, line, mo.start()-line_start)
pos = mo.end()
mo = get_token(s, pos)
if pos != len(s):
raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))
statements = '''
IF quantity THEN
total := total + price * quantity;
tax := price * 0.05;
ENDIF;
'''
for token in tokenize(statements):
print(token)
The trick here is the line:
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Here (?P<ID>PATTERN) will mark the matched result with a name specified by ID.
re.match is anchored. You can give it a position argument:
pos = 0
end = len(text)
while pos < end:
match = regexp.match(text, pos)
# do something with your match
pos = match.end()
Have a look for pygments which ships a shitload of lexers for syntax highlighting purposes with different implementations, most based on regular expressions.
It's possible that combining the token regexes will work, but you'd have to benchmark it. Something like:
x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
token = [a for a in a.items() if a[1] != None][0]
The filter is where you'll have to do some benchmarking...
Update: I tested this, and it seems as though if you combine all the tokens as stated and write a function like:
def find_token(lst):
for tok in lst:
if tok[1] != None: return tok
raise Exception
You'll get roughly the same speed (maybe a teensy faster) for this. I believe the speedup must be in the number of calls to match, but the loop for token discrimination is still there, which of course kills it.
This isn't exactly a direct answer to your question, but you might want to look at ANTLR. According to this document the python code generation target should be up to date.
As to your regexes, there are really two ways to go about speeding it up if you're sticking to regexes. The first would be to order your regexes in the order of the probability of finding them in a default text. You could figure adding a simple profiler to the code that collected token counts for each token type and running the lexer on a body of work. The other solution would be to bucket sort your regexes (since your key space, being a character, is relatively small) and then use a array or dictionary to perform the needed regexes after performing a single discrimination on the first character.
However, I think that if you're going to go this route, you should really try something like ANTLR which will be easier to maintain, faster, and less likely to have bugs.

Categories

Resources