Python Class variable increments higher than the # of iterations - python

I'm a bit stumped on this one, and I'm sure it has to due with my lack of experience in Python OOP and how classes work, or how the new string formatting is functioning. Posting here so that others can reference if they have a similar issue.
The problem: I use the following Class as a generator in a loop to read in a file and parse it line by line. Each line of interest increments the self.reads_total variable by one, and if that line meets certain criteria, it increments the self.reads_mapping variable by 1. At the end (just before the StopIteration() call on the next() method, I output those two values to stdout. When I run this from the command line on a file with 52266374 lines, I get the following back:
Reads processed: 5220000029016 with both mates mapped out of 52263262 total reads
This output is generated just before the termination of the iteration by the following line of code:
print("{0} with both mates mapped out of {1} total reads\n".format(self.reads_mapping, self.reads_total))
The self.reads_total is outputting the correct # of iterations, but the self.reads_mapping is not.
Class object that is called by a simple for x in SamParser(infile) loop:
class SamParser:
"""This object takes as input a SAM file path and constructs an iterable that outputs
sequence information. Only one line will be held in memory at a time using this method.
"""
def __init__(self, filepath):
"""
constructor
#param filepath: filepath to the input raw SAM file.
"""
if os.path.exists(filepath): # if file is a file, read from the file
self.sam_file = str(filepath)
self.stdin = False
elif not sys.stdin.isatty(): # else read from standard in
self.stdin = True
else:
raise ValueError("Parameter filepath must be a SAM file")
self.current_line = None
self.reads_mapping = 0
self.reads_total = 0
# Allowable bitflags for SAM file -> reads with both mates mapping, regardless of other flags
self.true_flags = (99, 147, 83, 163, 67, 131, 115, 179, 81, 161, 97, 145, 65, 129, 113, 177)
def __iter__(self):
return self
def _iterate(self):
# Skip all leading whitespace
while True:
if self.stdin:
sam_line = sys.stdin.readline() # read from stdin
else:
sam_line = self.sam_file.readline() # read from file
if not sam_line:
return # End of file
if sam_line.startswith("#SQ"): # these lines contain refseq length information
temp = sam_line.split()
return temp[1][3:], temp[2][3:]
elif sam_line[0] != "#": # these lines are the actual reads
self.reads_total += 1
if self.reads_total % 100000 == 0: # update the counter on stdout every 100000 reads
sys.stdout.write("\rReads processed: {}".format(self.reads_total))
sys.stdout.flush()
temp = sam_line.split()
if int(temp[1]) in self.true_flags and temp[2] is not "*" and int(temp[3]) is not 0:
self.reads_mapping += 1
return temp[1], temp[2], temp[3], temp[9]
self.sam_file.close() # catch all in case this line is reached
assert False, "Should not reach this line"
def next(self):
if not self.stdin and type(self.sam_file) is str: # only open file here if sam_file is a str and not file
self.sam_file = open(self.sam_file, "r")
value = self._iterate()
if not value: # close file on EOF
if not self.stdin:
self.sam_file.close()
print("{0} with both mates mapped out of {1} total reads\n".format(self.reads_mapping, self.reads_total))
raise StopIteration()
else:
return value
The full script can be found here if you need more context: https://github.com/lakinsm/file-parsing/blob/master/amr_skewness.py

Simple string formatting mistake: the script is designed to update the line count in place on stdout, and I didn't include a newline at the beginning of the next line, so it was overwriting the last line output, which included the large read count from the previous total. The true output should be:
Reads processed: 52200000
29016 with both mates mapped out of 52263262 total reads
Plots generated: 310
In any case, the script is useful for SAM parsing if anyone stumbles upon this in the future. Don't make my same mistake.

Related

Pandoc Filter via Panflute not Working as Expected

Problem
For a Markdown document I want to filter out all sections whose header titles are not in the list to_keep. A section consists of a header and the body until the next section or the end of the document. For simplicity lets assume that the document only has level 1 headers.
When I make a simple case distinction on whether the current element has been preceeded by a header in to_keep and do either return None or return [] I get an error. That is, for pandoc --filter filter.py -o output.pdf input.md I get TypeError: panflute.dump needs input of type "panflute.Doc" but received one of type "list" (code, example file and complete error message at the end).
I use Python 3.7.4 and panflute 1.12.5 and pandoc 2.2.3.2.
Question
If make a more fine grained distinction on when to do return [], it works (function action_working). My question is, why is this more fine grained distinction neccesary? My solution seems to work, but it might well be accidental... How can I get this to work properly?
Files
error
Traceback (most recent call last):
File "filter.py", line 42, in <module>
main()
File "filter.py", line 39, in main
return run_filter(action_not_working, doc=doc)
File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 266, in run_filter
return run_filters([action], *args, **kwargs)
File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 253, in run_filters
dump(doc, output_stream=output_stream)
File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 132, in dump
raise TypeError(msg)
TypeError: panflute.dump needs input of type "panflute.Doc" but received one of type "list"
Error running filter filter.py:
Filter returned error status 1
input.md
# English
Some cool english text this is!
# Deutsch
Dies ist die deutsche Übersetzung!
# Sources
Some source.
# Priority
**Medium** *[Low | Medium | High]*
# Status
**Open for Discussion** *\[Draft | Open for Discussion | Final\]*
# Interested Persons (mailing list)
- Franz, Heinz, Karl
fiter.py
from panflute import *
to_keep = ['Deutsch', 'Status']
keep_current = False
def action_not_working(elem, doc):
'''For every element we check if it occurs in a section we wish to keep.
If it is, we keep it and return None (indicating to keep the element unchanged).
Otherwise we remove the element (return []).'''
global to_keep, keep_current
update_keep(elem)
if keep_current:
return None
else:
return []
def action_working(elem, doc):
global to_keep, keep_current
update_keep(elem)
if keep_current:
return None
else:
if isinstance(elem, Header):
return []
elif isinstance(elem, Para):
return []
elif isinstance(elem, BulletList):
return []
def update_keep(elem):
'''if the element is a header we update to_keep.'''
global to_keep, keep_current
if isinstance(elem, Header):
# Keep if the title of a section is in too keep
keep_current = stringify(elem) in to_keep
def main(doc=None):
return run_filter(action_not_working, doc=doc)
if __name__ == '__main__':
main()
I think what happens is that panflute call the action on all elements, including the Doc root element. If keep_current is False when walking the Doc element, it will be replaced by a list. This leads to the error message you are seeing, as panflute expectes the root node to always be there.
The updated filter only acts on Header, Para, and BulletList elements, so the Doc root node will be left untouched. You'll probably want to use something more generic like isinstance(elem, Block) instead.
An alternative approach could be to use panflute's load and dump elements directly: load the document into a Doc element, manually iterate over all blocks in args and remove all that are unwanted, then dump the resulting doc back into the output stream.
from panflute import *
to_keep = ['Deutsch', 'Status']
keep_current = False
doc = load()
for top_level_block in doc.args:
# do things, remove unwanted blocks
dump(doc)

the result printed to screen and writen to file are different

# coding=utf-8
def compare(arr1, arr2):
arr1 = arr1.strip()
arr2 = arr2.strip()
arr1 = arr1.split('\t')
arr2 = arr2.split('\t')
# print arr1[0], arr1[1]
arr1min = min(long(arr1[0]), long(arr1[1]))
arr1max = max(long(arr1[0]), long(arr1[1]))
arr2min = min(long(arr2[0]), long(arr2[1]))
arr2max = max(long(arr2[0]), long(arr2[1]))
# print arr1max, arr2max, arr1min, arr1min
if (arr1min < arr2min):
return -1
elif (arr1min > arr2min):
return 1
else:
if (arr1max < arr2max):
return -1
elif (arr1max > arr2max):
return 1
else:
return 0
f = open('er1000000new.txt')
fwrite = open('erzhesorted.txt', 'w')
lines = f.readlines()
lines.sort(compare)
for line in lines:
# fwrite.write(str(line))
print line
f.close()
fwrite.close()
the compare is custom sort function.
for example, when the result printed on screen, the result is
752555452697747457\t752551879448547328\t1468258301659\n
752563934733873152\t752561055289577472\t1468260508664\n
but the result printed on file is
6782762\t12\t1468248110665\n
2660899225\t12\t1468229395665\n
The two results are different, why?
The wb option when you open a file, makes python waiting for a byte object to write in that file.
Your (commented) line fwrite.write(str(line)) sends a string object.
Doesn't it produce the following error?
TypeError: a bytes-like object is required, not 'str'
Also it may come from your compare attribute in sort method.
What is it?
→ Removing the b option and removing the compare attribute produces the same outputs in term and file outputs.

Python ValueError: chr() arg not in range(256)

So I am learning python and redoing some old projects. This project involves taking in a dictionary and a message to be translated from the command line, and translating the message. (For example: "btw, hello how r u" would be translated to "by the way, hello how are you".
We are using a scanner supplied by the professor to read in tokens and strings. If necessary I can post it here too. Heres my error:
Nathans-Air-4:py1 Nathan$ python translate.py test.xlt test.msg
Traceback (most recent call last):
File "translate.py", line 26, in <module>
main()
File "translate.py", line 13, in main
dictionary,count = makeDictionary(commandDict)
File "/Users/Nathan/cs150/extra/py1/support.py", line 12, in makeDictionary
string = s.readstring()
File "/Users/Nathan/cs150/extra/py1/scanner.py", line 105, in readstring
return self._getString()
File "/Users/Nathan/cs150/extra/py1/scanner.py", line 251, in _getString
if (delimiter == chr(0x2018)):
ValueError: chr() arg not in range(256)
Heres my main translate.py file:
from support import *
from scanner import *
import sys
def main():
arguments = len(sys.argv)
if arguments != 3:
print'Need two arguments!\n'
exit(1)
commandDict = sys.argv[1]
commandMessage = sys.argv[2]
dictionary,count = makeDictionary(commandDict)
message,messageCount = makeMessage(commandMessage)
print(dictionary)
print(message)
i = 0
while count < messageCount:
translation = translate(message[i],dictionary,messageCount)
print(translation)
count = count + 1
i = i +1
main()
And here is my support.py file I am using...
from scanner import *
def makeDictionary(filename):
fp = open(filename,"r")
s = Scanner(filename)
lyst = []
token = s.readtoken()
count = 0
while (token != ""):
lyst.append(token)
string = s.readstring()
count = count+1
lyst.append(string)
token = s.readtoken()
return lyst,count
def translate(word,dictionary,count):
i = 0
while i != count:
if word == dictionary[i]:
return dictionary[i+1]
i = i+1
else:
return word
i = i+1
return 0
def makeMessage(filename):
fp = open(filename,"r")
s = Scanner(filename)
lyst2 = []
string = s.readtoken()
count = 0
while (string != ""):
lyst2.append(string)
string = s.readtoken()
count = count + 1
return lyst2,count
Does anyone know whats going on here? I've looked through several times and i dont know why readString is throwing this error... Its probably something stupid i missed
chr(0x2018) will work if you use Python 3.
You have code that's written for Python 3 but you run it with Python 2. In Python 2 chr will give you a one character string in the ASCII range. This is an 8-bit string, so the maximum parameter value for chris 255. In Python 3 you'll get a unicode character and unicode code points can go up to much higher values.
The issue is that the character you're converting using chr isn't within the range accepted (range(256)). The value 0x2018 in decimal is 8216.
Check out unichr, and also see chr.

Getting the error "TypeError: 'NoneType' object is not iterable" when trying to use whisper-merge

I'm trying to use whisper-merge to merge 2 wsp files. They have identical retention strategies, one just has older data than the other.
When I run whisper-merge oldfile.wsp newfile.wsp I get this error
Traceback (most recent call last):
File "/usr/local/src/whisper-0.9.12/bin/whisper-merge.py", line 32, in <module>
whisper.merge(path_from, path_to)
File "/usr/local/lib/python2.7/dist-packages/whisper.py", line 821, in merge
(timeInfo, values) = fetch(path_from, fromTime, untilTime)
TypeError: 'NoneType' object is not iterable
Any ideas?
Here's the meta data output for the 2 files:
"from_file" http://sprunge.us/dBHC
"to_file" http://sprunge.us/eIVG
Line 812 in whisper.py is broken for files that contain multiple archives.
https://github.com/graphite-project/whisper/blob/0.9.12/whisper.py#L812
fromTime = int(time.time()) - headerFrom['maxRetention']
To fix, immediately following line 813, assign fromTime based on the archive retention.
https://github.com/graphite-project/whisper/blob/0.9.12/whisper.py#L813
for archive in archives: # this line already exists
fromTime = int(time.time()) - archive['retention'] # add this line
Snippet from whisper.py
def fetch(path,fromTime,untilTime=None):
"""fetch(path,fromTime,untilTime=None)
path is a string
fromTime is an epoch time
untilTime is also an epoch time, but defaults to now.
Returns a tuple of (timeInfo, valueList)
where timeInfo is itself a tuple of (fromTime, untilTime, step)
Returns None if no data can be returned
"""
fh = open(path,'rb')
return file_fetch(fh, fromTime, untilTime)
Suggests that whisper.fetch() is returning None, which in turn, (along with the final line in the traceback) suggests that there is a problem with your path_from file.
Looking a little deeper, whisper.file_fetch() appears to have two places where it can return None (explicitly, at least):
def file_fetch(fh, fromTime, untilTime):
header = __readHeader(fh)
now = int( time.time() )
if untilTime is None:
untilTime = now
fromTime = int(fromTime)
untilTime = int(untilTime)
# Here we try and be flexible and return as much data as we can.
# If the range of data is from too far in the past or fully in the future, we
# return nothing
if (fromTime > untilTime):
raise InvalidTimeInterval("Invalid time interval: from time '%s' is after until time '%s'" % (fromTime, untilTime))
oldestTime = now - header['maxRetention']
# Range is in the future
if fromTime > now:
return None # <== Here
# Range is beyond retention
if untilTime < oldestTime:
return None # <== ...and here
...

Python multiprocessing "Bad file descriptor" error (not repeatable)

Apologies in advance, but I am unable to post a fully working example (too much overhead in this code to distill to a runnable snippet). I will post as much explanatory detail as I can, and please do let me know if anything critical seems missing.
Running Python 2.7.5 through IDLE
I am writing a program to compare two text files. Since the files can be large (~500MB) and each row comparison is independent, I would like to implement multiprocessing to speed up the comparison. This is working pretty well, but I am getting stuck on a pseudo-random Bad file descriptor error. I am new to multiprocessing, so I guess there is a technical problem with my implementation. Can anyone point me in the right direction?
Here is the code causing the trouble (specifically the pool.map):
# openfiles
csvReaderTest = csv.reader(open(testpath, 'r'))
csvReaderProd = csv.reader(open(prodpath, 'r'))
compwriter = csv.writer(open(outpath, 'wb'))
pool = Pool()
num_chunks = 3
chunksTest = itertools.groupby(csvReaderTest, keyfunc)
chunksProd = itertools.groupby(csvReaderProd, keyfunc)
while True:
# make a list of num_chunks chunks
groupsTest = [list(chunk) for key, chunk in itertools.islice(chunksTest, num_chunks)]
groupsProd = [list(chunk) for key, chunk in itertools.islice(chunksProd, num_chunks)]
# merge the two lists (pair off comparison rows)
groups_combined = zip(groupsTest,groupsProd)
if groups_combined:
# http://stackoverflow.com/questions/5442910/python-multiprocessing-pool-map-for-multiple-arguments
a_args = groups_combined # a list - set of combinations to be tested
second_arg = True
worker_result = pool.map(worker_mini_star, itertools.izip(itertools.repeat(second_arg),a_args))
Here is the full error output. (This error sometimes occurs, and other times the comparison runs to finish without problems):
Traceback (most recent call last):
File "H:/<PATH_SNIP>/python_csv_compare_multiprocessing_rev02_test2.py", line 407, in <module>
main(fileTest, fileProd, fileout, stringFields, checkFileLengths)
File "H:/<PATH_SNIP>/python_csv_compare_multiprocessing_rev02_test2.py", line 306, in main
worker_result = pool.map(worker_mini_star, itertools.izip(itertools.repeat(second_arg),a_args))
File "C:\Python27\lib\multiprocessing\pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Python27\lib\multiprocessing\pool.py", line 554, in get
raise self._value
IOError: [Errno 9] Bad file descriptor
If it helps, here are the functions called by pool.map:
def worker_mini(flag, chunk):
row_comp = []
for entry, entry2 in zip(chunk[0][0], chunk[1][0]):
if entry == entry2:
temp_comp = entry
else:
temp_comp = '%s|%s' % (entry, entry2)
row_comp.append(temp_comp)
return True, row_comp
#takes a single tuple argument and unpacks the tuple to multiple arguments
def worker_mini_star(flag_chunk):
"""Convert `f([1,2])` to `f(1,2)` call."""
return worker_mini(*flag_chunk)
def main():

Categories

Resources