"embarrassingly parallel" programming using python and PBS on a cluster

"embarrassingly parallel" programming using python and PBS on a cluster - python

I have a function (neural network model) which produces figures. I wish to test several parameters, methods and different inputs (meaning hundreds of runs of the function) from python using PBS on a standard cluster with Torque.
Note: I tried parallelpython, ipython and such and was never completely satisfied, since I want something simpler. The cluster is in a given configuration that I cannot change and such a solution integrating python + qsub will certainly benefit to the community.
To simplify things, I have a simple function such as:
import myModule
def model(input, a= 1., N=100):
do_lots_number_crunching(input, a,N)
pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')
where input is an object representing the input, input.name is a string, anddo_lots_number_crunching may last hours.
My question is: is there a correct way to transform something like a scan of parameters such as
for a in pylab.linspace(0., 1., 100):
model(input, a)
into "something" that would launch a PBS script for every call to the model function?
#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py
I was thinking of a function that would include the PBS template and call it from the python script, but could not yet figure it out (decorator?).

pbs_python[1] could work for this. If experiment_model.py 'a' as an argument you could do
import pbs, os
server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)
attopl = pbs.new_attropl(4)
attropl[0].name = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'
attropl[1].name = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'
attropl[2].name = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'
attrop1[3].name = pbs.ATTR_V
script='''
cd /data/work/
python experiment_model.py %f
'''
jobs = []
for a in pylab.linspace(0.,1.,100):
script_name = 'experiment_model.job' + str(a)
with open(script_name,'w') as scriptf:
scriptf.write(script % a)
job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
jobs.append(job_id)
os.remove(script_name)
print jobs
[1]: https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

You can do this easily using jug (which I developed for a similar setup).
You'd write in file (e.g., model.py):
#TaskGenerator
def model(param1, param2):
res = complex_computation(param1, param2)
pyplot.coolgraph(res)
for param1 in np.linspace(0, 1.,100):
for param2 in xrange(2000):
model(param1, param2)
And that's it!
Now you can launch "jug jobs" on your queue: jug execute model.py and this will parallelise automatically. What happens is that each job will in, a loop, do something like:
while not all_done():
for t in tasks in tasks_that_i_can_run():
if t.lock_for_me(): t.run()
(It's actually more complicated than that, but you get the point).
It uses the filesystem for locking (if you're on an NFS system) or a redis server if you prefer. It can also handle dependencies between tasks.
This is not exactly what you asked for, but I believe it's a cleaner architechture to separate this from the job queueing system.

It looks like I'm a little late to the party, but I also had the same question of how to map embarrassingly parallel problems onto a cluster in python a few years ago and wrote my own solution. I recently uploaded it to github here: https://github.com/plediii/pbs_util
To write your program with pbs_util, I would first create a pbs_util.ini in the working directory containing
[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00
Then a python script like this
import pbs_util.pbs_map as ppm
import pylab
import myModule
class ModelWorker(ppm.Worker):
def __init__(self, input, N):
self.input = input
self.N = N
def __call__(self, a):
myModule.do_lots_number_crunching(self.input, a, self.N)
pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')
# You need "main" protection like this since pbs_map will import this file on the compute nodes
if __name__ == "__main__":
input, N = something, picklable
# Use list to force the iterator
list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
startup_args=(input, N),
num_clients=100))
And that would do it.

I just started working with clusters and EP applications. My goal (I'm with the Library) is to learn enough to help other researchers on campus access HPC with EP applications...especially researchers outside of STEM. I'm still very new, but thought it may help this question to point out the use of GNU Parallel in a PBS script to launch basic python scripts with varying arguments. In the .pbs file, there are two lines to point out:
module load gnu-parallel # this is required on my environment
parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*
# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}` this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.'
# will be substituted into the python call as the second(3rd) argument where the
# `{}` resides. These can be simple text files that you use in your 'simple.py'
# script to pass the parameter sets, filenames, etc.
As a newby to EP supercomputing, even though I don't yet understand all the other options on "parallel", this command allowed me to launch python scripts in parallel with different parameters. This would work well if you can generate a slew of parameter files ahead of time that will parallelize your problem. For example, running simulations across a parameter space. Or processing many files with the same code.

Related

numba or dask: parallel to call bash script by python

I am a beginner in the field of python parallel computation. I want to parallize one part of my codes which consists for-loops. By default, my code run a time for loop for 3 years. In each day, my code calls a bash script run_offline.sh and run it 8 times. Each time the bash script is given different input data indexing by the loop id. Here is the main part of my python codes demo.py:
import os
import numpy as np
from dateutil.rrule import rrule, HOURLY, DAILY
import datetime
import subprocess
...
start_date = datetime.datetime(2017, 1, 1, 10, 0, 0)
end_date = datetime.datetime(2019, 12, 31, 10, 0, 0)
...
loop_id = 0
for date_assim in rrule(freq=HOURLY,
dtstart=start_date,
interval=time_delta,
until=end_date):
RESDIR='./results/'
TYP='experiment_1'
END_ID = 8
YYYYMMDDHH = date_assim.strftime('%Y%m%d%H')
p1 = subprocess.Popen(['./run_offline.sh', str(END_ID), str(loop_id), str(YYYYMMDDHH), RESDIR, TYP])
p1.wait()
#%%
# p1 creates 9 files from RESULTS_0.nc to RESULTS_8.nc
# details please see the bash script attached below
# following are codes computing based on RESULTS_${CID}.nc, CID from 0 to 8.
# In total 9 files are generated by p1 and used later.
loop_id += 1
And the ./run_offline.sh runs an atmsopheric model offline.exe 9 times which follows:
#!/bin/bash
# Usage: ./run_offline.sh END_ID loop_id YYYYMMDDHH RESDIR TYP
END_ID=${1:-1}
loop_id=${2:-1}
YYYYMMDDHH=${3:-1}
RESDIR=${4:-1}
TYP=${5:-1}
END_ID=`echo $((END_ID))`
loop_id=`echo $((loop_id))`
CID=0
ln -sf PREP_0.nc PREP.nc # one of the input file required. Must named by PREP.nc
while [ $CID -le $END_ID ]; do
cp -f ./OPTIONS.nam_${CID} ./OPTIONS.nam # one of the input file required by offline.exe
# different ./OPTIONS.nam_${CID} has different index of a perturbation.
# Say ./OPTIONS.nam_1 lets the offline.exe knows it should perturb the first variable in the atmospheric model,
# ./OPTIONS.nam_2 perturbs the second variable...
./offline.exe
cp RESULTS1.nc RESULTS_${CID}.OUT.nc # for next part of python codes in demo.py
mv RESULTS2.nc $RESDIR/$TYP/RESULTS2_${YYYYMMDDHH}.nc # store this file in my results dir
CID=$((CID+1))
done
Now I found the for-loop of offline.exe is super time-consuming. It's around 10-20s each time I called run_offline.sh (running ./offline.exe 9 times costs 10-20s). In total it costs 15s*365*3=4.5hours on average, if I want to run my scripts for 3 years...So can I parallize the loop of offline.exe? say assign the different run of different CID to different core/subprocess in the server. But one should note that two input files OPTIONS.nam and PREP.nc are forced to name as same names when we run offline.exe each time....which means we cannot use OPTIONS.nam_x for loop x. So can I use dask or numba to help this parallelization? Thanks!

If I understand your problem correctly, you run ~1000 times a bash script which runs 8~9 time a black-box executable and this executable is the main bottleneck.
So can I parallize the loop of offline.exe?
This is hard to say due to the executable being a black box. You need to check the input/output/temporary data required by the program. For example, if the program store temporary file somewhere in the storage device, then calling it in parallel will result in a race condition. Besides, you can only call it in parallel if the computational parts are fully independent. A dataflow analysis is very useful to know whether you can parallelize an application (especially when it is composed of multiple programs).
Additionally, you need to check if the program is already parallel or not. Running in parallel multiple parallel programs generally results in a much slower execution due to a large amount of thread to schedule, poor cache usage, bad synchronization patterns, etc.
In your case, I think the best option would be to parallelize the program run in the loop (ie. offline.exe). Otherwise, if the program is sequential and can be parallelised (see above), then you can run multiple processes using & in bash and then wait them in the end of the loop. Alternatively you can use GNU parallel.
But one should note that two input files OPTIONS.nam and PREP.nc are forced to name as same names when we run offline.exe each time
This can be solved by calling the N program from N distinct working directories in parallel. This is actually safer if the program creates temporary files in its working directory. You need to move/copy the files before the parallel execution and certainly after.
If the files OPTIONS.nam and/or PREP.nc are modified by the program, then it means the computation is completely sequential and cannot be parallelized (I assume the computation of each day is dependent of the previous one as this is a very common pattern in scientific simulations).
So can I use dask or numba to help this parallelization?
No. Dask and Numba are not mean to be used in this context. They are designed to operate on Numpy array in a Python code. The part you want to parallelize is in bash and the parallelized program is apparently not even written in Python.

If you cannot make offline.exe use other names than OPTIONS.nam and RESULTS1.nc, you will need to make sure that the parallel instances do not overwrite eachother.
One way to do this is to make a dir for each run:
#!/bin/bash
# Usage: ./run_offline.sh END_ID loop_id YYYYMMDDHH RESDIR TYP
END_ID=${1:-1}
loop_id=${2:-1}
YYYYMMDDHH=${3:-1}
RESDIR=${4:-1}
TYP=${5:-1}
END_ID=`echo $((END_ID))`
loop_id=`echo $((loop_id))`
doit() {
mkdir $1
cd $1
ln -sf ../PREP_$1.nc PREP.nc
cp -f ./OPTIONS.nam_$1 ./OPTIONS.nam # one of the input file required by
../offline.exe
cp RESULTS1.nc ../RESULTS_$1.OUT.nc # for next part of python codes in demo.py
mv RESULTS2.nc $RESDIR/$TYP/RESULTS2_${YYYYMMDDHH}.nc # store this file in my results dir
}
export -f doit
export RESDIR YYYYMMDDHH TYP
seq 0 $END_ID | parallel doit

How to catch the terminal output?

I'm working on https://github.com/JsBergbau/MiTemperature2 with raspberry pi 3 model b. It's working properly on its own infinite loop but I am not able to catch the output from the terminal. How can I reach to output by using python?
Here is the part of printing:
measurement_time = datetime.datetime.fromtimestamp(measurement.timestamp)
print(measurement_time)
humidity=int.from_bytes(data[2:3],byteorder='little')
print("Temperature: " + str(temp))
print("Humidity: " + str(humidity))
voltage=int.from_bytes(data[3:5],byteorder='little') / 1000.
print("Battery voltage:",voltage,"V")
measurement.temperature = temp
measurement.humidity = humidity
measurement.voltage = voltage
measurement.sensorname = args.name
batteryLevel = min(int(round((voltage - 2.1),2) * 100), 100) #3.1 or above --> 100% 2.1 --> 0 %
measurement.battery = batteryLevel
print("Battery level:",batteryLevel)
measurement_time = datetime.datetime.fromtimestamp(measurement.timestamp)
Here is the script I run on terminal:
python3 LYWSD03MMC.py -d AA:BB:CC:DD:EE:FF
And here is the output:
2021-08-05 11:21:24
Temperature: 24.79
Humidity: 47
Battery voltage: 3.092 V
Battery level: 99
here is the run command and sample output, thanks for helps, best regards.

Change your code so it returns the information instead of just printing it. If you have code which looks like
something = some_function_call(123)
print(something)
other_one = different_function("some data here?").strip()
print(other_one)
probably refactor to
def get_something(number):
return some_function_call(number)
def get_other_one():
return different_function("some data here?").strip()
if __name__ == '__main__':
print(get_something(123))
print(get_other_one())
Now, you can create additional code which retrieves these values without printing them, and does whatever it wants with them. Put them on a web site? Upload them to a database? Rot13 encrypt them and send an email to Bill Gates? Your imagination is the limit.
How exactly you design your code is a broad topic where many books have been written, and more will be. A common arrangement is to make sure the useful parts are in modular functions which do one thing only (ideally without any side effects) so you can import this code and use it from other programs. (That's why the if __name__ part is useful. It makes sure code inside the block doesn't run when you import this file.)

Have you had a closer look at the code? There is a callback option. This is the easiest way to get values from this script. Or is this question more academically on how to capture python output?
If not, that should help you:
Documentation where callback is described:
https://github.com/JsBergbau/MiTemperature2#callback-for-processing-the-data
Accessing the single values:
In sendToInflux.sh https://github.com/JsBergbau/MiTemperature2/blob/master/sendToInflux.sh is an example in which argument are the values like temperature and so on.
Or when using sendToFile.sh it gives line by line
sensorname,temperature,humidity,voltage,humidityCalibrated,timestamp MySensor 20.61 54 2.944 49 1582120122
That data should be easy to process by python or awk.

add commandline
2>&1 | tee result.txt
it can save command line output

If you are running a command from python you can use subprocess.check_output to get the returning output from the terminal. Don't work if the called script runs forever.
Like this:
output = subprocess.check_output([sys.executable, 'LYWSD03MMC.py', '-d', 'AA:BB:CC:DD:EE:FF']).decode()

Multiple scripts access the same module with the same data in python?

Recently I have been trying to make a makeshift "disk space" reader. I made a library that stores values in a list "the disk" and when I subprocess a new script to write to the "disk" to see if the values change on the display nothing happens. I realized that any time you import a module the module sort of clones itself to only that script.
I want to be able to have scripts import the same module and so that if 1 script changes a value another script can see that value.
Here is my code for the "disk" system
import time
ram = []
space = 256000
lastspace = 0
for i in range(0,space + 1):
ram.append('')
def read(location):
try:
if ram[int(location)] == '':
return "ERR_NO_VALUE"
else:
return ram[int(location)]
except:
return "ERR_OUT_OF_RANGE"
def write(location, value):
try:
ram[int(location)] = value
except:
return "ERR_OUT_OF_RANGE"
def getcontents():
contents = []
for i in range(0, 256001):
contents.append([str(i)+ '- ', ram[i]])
return contents
def getrawcontents():
contents = []
for i in range(0, 256001):
contents.append(ram[i])
return contents
def erasechunk(beg, end):
try:
for i in range(int(beg), int(end) + 1):
ram[i] = ''
except:
return "ERR_OUT_OF_RANGE"
def erase(location):
ram[int(location)] = ''
def reset():
ram = []
times = space/51200
tc = 0
for i in range(0,round(times)):
for x in range(0,51201):
ram.append('')
tc += 1
print("Byte " + str(tc) + " of " + " Bytes")
for a in range(0,100):
print('\a', end='')
return [len(ram), ' bytes']
def wipe():
for i in range(0,256001):
ram[i] = ''
return "WIPED"
def getspace():
x = 0
for i in range(0,len(ram)):
if ram[i] != "":
x += 1
return [x,256000]

The shortest answer to your question, which I'm understanding as "if I import the same function into two (or more) Python namespaces, can they interact with each other?", is no. What actually happens when you import a module is that Python uses the source script to 'build' those functions in the namespace you're importing them to; there is no sense of permanence in "where the module came from" since that original module isn't actually running in a Python process anywhere! When you import those functions into multiple scripts, it's just going to create those pseudo-global variables (in your case ram) with the function you're importing.
Python import docs: https://docs.python.org/3/reference/import.html
The whole page on Python's data model, including what __globals__ means for functions and modules: https://docs.python.org/3/reference/datamodel.html
Explanation:
To go into a bit more depth, when you import any of the functions from this script (let's assume it's called 'disk.py'), you'll get an object in that function's __globals__ dict called ram, which will indeed work as you expect for these functions in your current namespace:
from disk import read,write
write(13,'thing')
print(read(13)) #prints 'thing'
We might assume, since these functions are accurately accessing our ram object, that the ram object is being modified somehow in the namespace of the original script, which could then be accessed by a different script (a different Python process). Looking at the namespace of our current script using dir() might support that notion, since we only see read and write, and not ram. But the secret is that ram is hidden in those functions' __globals__ dict (mentioned above), which is how the functions are interacting with ram:
from disk import read,write
print(type(write.__globals__['ram'])) #<class 'list'>
print(write.__globals__['ram'] is read.__globals__['ram']) #True
write(13,'thing')
print(read(13)) #'thing'
print(read.__globals__['ram'][13]) #'thing'
As you can see, ram actually is a variable defined in the namespace of our current Python process, hidden in the functions' __globals__ dict, which is actually the exact same dictionary for any function imported from the same module; read.__globals__ is write.__globals__ evaluates to True (even if you don't import them at the same time!).
So, to wrap it all up, ram is contained in the __globals__ dict for the disk module, which is created separately in the namespace of each process you import into:
Python interpreter #1:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139775502955080 139775502955080
Python interpreter #2:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139797009773128 139797009773128
Solution hint:
There are many approaches on how to do this practically that are beyond the scope of this answer, but I will suggest that pickle is the standard way to send objects between Python interpreters using files, and has a really standard interface. You can just write, read, etc your ram object using a pickle file. To write:
import pickle
with open('./my_ram_file.pkl','wb') as ram_f:
pickle.dump(ram,ram_f)
To read:
import pickle
with open('./my_ram_file.pkl','rb') as ram_f:
ram = pickle.load(ram_f)

Oscillating processing speed in a python script using pysam.TabixFile to annotate reads

The initial question
I'm writing a bioinformatics script in python (3.5) that parses a large (sorted and indexed) bam file representing sequencing reads aligned on a genome, associates genomic information ("annotations") to these reads, and counts the types of annotations encountered. I'm measuring the speed at which my script processes aligned reads (over batches of 1000 reads), and I obtain the following speed variations:
What could explain this pattern?
My intuition would make me bet on some data structure that progressively gets slower as it gets denser, but which is expanded from time to time.
It doesn't seem that memory usage is significant, though (after almost 2 hours running, my script still uses only 0.1% of the memory of my computer, according to htop).
How my code works (see at the end for the actual code)
I'm using the pysam module to do the bam file parsing. The AlignmentFile.fetch method gives me an iterator providing information about successive aligned reads in the form of AlignedSegment objects.
I associate annotations to the reads based on their alignment coordinates and an annotation file in gtf format (compressed with bgzip and indexed with tabix). I use the TabixFile.fetch method (still from pysam) to get these annotations, I filter them and yield a summary of them in the form of a frozenset of strings (process_annotations, not shown below, returns such a frozenset), in a generator function that internally loops over the AlignedSegment iterator.
I feed the generated frozensets to a Counter object. Could the counter be responsible for the observed speed behaviour ?
How can I find out how to avoid these regular slowing?
Additional tests
Following suggestions in the comments, I profiled my whole analysis using cProfile and found that the most running time was spent while accessing annotation data (pysam/ctabix.pyx:579(__cnext__), see the call graph later), which, if I understand correctly is some Cython code interfacing with the samtools C libraries. It seemed the cause for the observed slowing would be difficult to understand.
In an attempt to speed up my script, I tried another solution based on the pybedtools python interface with bedtools, which can also retrieve annotations from gtf files (https://daler.github.io/pybedtools/index.html).
Speed
The speed improvement is quite important. Here are the actual command and timing results (the two were actually run in parallel):
$ time python3 -m cProfile -o tests/total_pybedtools.prof ~/src/bioinfo_utils/small_RNA_seq_annotate.py -b results/bowtie2/mapped_C_elegans/WT_1_21-26_on_C_elegans_sorted.bam -g annotations/all_annotations.gtf -a "pybedtools" -l total_pybedtools.log > total_pybedtools.out
real 5m48.474s
user 5m48.204s
sys 0m1.336s
$ time python3 -m cProfile -o tests/total_tabix.prof ~/src/bioinfo_utils/small_RNA_seq_annotate.py -b results/bowtie2/mapped_C_elegans/WT_1_21-26_on_C_elegans_sorted.bam -g annotations/all_annotations.gtf.gz -a "tabix" -l total_tabix.log > total_tabix.out
real 195m40.990s
user 194m54.356s
sys 0m47.696s
(To be noted: the annotation results are slightly different between the two approached. Maybe I should check how I handle coordinates.)
The speed profile doesn't have the previously observed long-period drops:
My speed problem is solved but I'm still interested in hindsight as to why the tabix-based approach has these speed drops. I added the "bioinformatics" and "samtools" tag for this reason.
Call graphs
For the record, I generated call graphs using gprof2dot on the profiling results:
$ gprof2dot -f pstats tests/total_pybedtools.prof \
| dot -Tpng -o tests/total_pybedtools_callgraph.png
$ gprof2dot -f pstats tests/total_tabix.prof \
| dot -Tpng -o tests/total_tabix_callgraph.png
Here is the call graph for the tabix-based approach:
For the pybedtools-based approach:
The code
Here is the main part of my current code:
#contextmanager
def annotation_context(annot_file, getter_type):
"""Yields a function to get annotations for an AlignedSegment."""
if getter_type == "tabix":
gtf_parser = pysam.ctabix.asGTF()
gtf_file = pysam.TabixFile(annot_file, mode="r")
fetch_annotations = gtf_file.fetch
def get_annotations(ali):
"""Generates an annotation getter for *ali*."""
return fetch_annotations(*ALI2POS_INFO(ali), parser=gtf_parser)
elif getter_type == "pybedtools":
gtf_file = open(annot_file, "r")
# Does not work because somehow gets "consumed" after first usage
#fetch_annotations = BedTool(gtf_file).all_hits
# Much too slow
#fetch_annotations = BedTool(gtf_file.readlines()).all_hits
# https://daler.github.io/pybedtools/topical-low-level-ops.html
fetch_annotations = BedTool(gtf_file).as_intervalfile().all_hits
def get_annotations(ali):
"""Generates an annotation list for *ali*."""
return fetch_annotations(Interval(*ALI2POS_INFO(ali)))
else:
raise NotImplementedError("%s not available" % getter_type)
yield get_annotations
gtf_file.close()
def main():
"""Main function of the program."""
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument(
"-b", "--bamfile",
required=True,
help="Sorted and indexed bam file containing the mapped reads."
"A given read is expected to be aligned at only one location.")
parser.add_argument(
"-g", "--gtf",
required=True,
help="A sorted, bgzip-compressed gtf file."
"A corresponding .tbi tabix index should exist.")
parser.add_argument(
"-a", "--annotation_getter",
choices=["tabix", "pybedtools"],
default="tabix",
help="Method to use to access annotations from the gtf file.")
parser.add_argument(
"-l", "--logfile",
help="File in which to write logs.")
args = parser.parse_args()
if not args.logfile:
logfilename = "%s.log" % args.annotation_getter
else:
logfilename = args.logfile
logging.basicConfig(
filename=logfilename,
level=logging.DEBUG)
INFO = logging.info
DEBUG = logging.debug
WARNING = logging.warning
process_annotations = make_annotation_processor(args.annotation_getter)
with annotation_context(args.gtf, args.annotation_getter) as get_annotations:
def generate_annotations(bamfile):
"""Generates annotations for the alignments in *bamfile*."""
last_t = perf_counter()
for i, ali in enumerate(bamfile.fetch(), start=1):
if not i % 1000:
now = perf_counter()
INFO("%d alignments processed (%.0f alignments / s)" % (
i,
1000.0 / (now - last_t)))
#if not i % 50000:
# gc.collect()
last_t = perf_counter()
yield process_annotations(get_annotations(ali), ali)
with pysam.AlignmentFile(args.bamfile, "rb") as bamfile:
annot_stats = Counter(generate_annotations(bamfile))
print(*reversed(annot_stats.most_common()), sep="\n")
return 0
(I used a contextmanager and other higher-order functions (make_annotation_processor and functions this one calls) to make it easier to have various annotation retrieving approaches in the same script.)

Python network bandwidth monitor

I am developing a program in python, and one element tells the user how much bandwidth they have used since the program has opened (not just within the program, but regular web browsing while the program has been opened). The output should be displayed in GTK
Is there anything in existence, if not can you point me in the right direction. It seems like i would have to edit an existing proxy script like pythonproxy, but i can't see how i would use it.
Thanks,

For my task I wrote very simple solution using psutil:
import time
import psutil
def main():
old_value = 0
while True:
new_value = psutil.net_io_counters().bytes_sent + psutil.net_io_counters().bytes_recv
if old_value:
send_stat(new_value - old_value)
old_value = new_value
time.sleep(1)
def convert_to_gbit(value):
return value/1024./1024./1024.*8
def send_stat(value):
print ("%0.3f" % convert_to_gbit(value))
main()

import time
def get_bytes(t, iface='wlan0'):
with open('/sys/class/net/' + iface + '/statistics/' + t + '_bytes', 'r') as f:
data = f.read();
return int(data)
while(True):
tx1 = get_bytes('tx')
rx1 = get_bytes('rx')
time.sleep(1)
tx2 = get_bytes('tx')
rx2 = get_bytes('rx')
tx_speed = round((tx2 - tx1)/1000000.0, 4)
rx_speed = round((rx2 - rx1)/1000000.0, 4)
print("TX: %fMbps RX: %fMbps") % (tx_speed, rx_speed)
should be work

Well, not quiet sure if there is something in existence (written in python) but you may want to have a look at the following.
Bandwidth Monitoring (Not really an active project but may give you an idea).
Munin Monitoring (A pearl based Network Monitoring Project)
ntop (written in C/C++, based on libpcap)
Also just to give you pointers if you are looking to do something on your own, one way could be to count and store packets using sudo cat /proc/net/dev

A proxy would only cover network applications that were configured to use it. You could set, e.g. a web browser to use a proxy, but what happens when your proxy exits?
I think the best thing to do is to hook in lower down the stack. There is a program that does this already, iftop. http://en.wikipedia.org/wiki/Iftop
You could start by reading the source code of iftop, perhaps wrap that into a Python C extension. Or rewrite iftop to log data to disk and read it from Python.

Would something like WireShark (https://wiki.wireshark.org/FrontPage) do the trick? I am tackling a similar problem now, and am inclined to use pyshark, a WireShark/TShark wrapper, for the task. That way you can get capture file info readily.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.