How to read and change BAM files from a Python script?

How to read and change BAM files from a Python script? - python

I'm planning on using a Python script to change different BAM (Binary Alignment Map) file headers. Right now I am just testing the output of one bam file but every time I want to check my output, the stdout is not human readable. How can I see the output of my script? Should I use samtools view bam.file on my script? This is my code.
#!/usr/bin/env python
import os
import subprocess
if __name__=='__main__':
for file in os.listdir(os.getcwd()):
if file == "SRR4209928.bam":
with open("SRR4209928.bam", "r") as input:
content = input.readlines()
for line in content:
print(line)

Since BAM is a binary type of SAM, you will need to write something that knows how to deal with the compressed data before you can extract something meaningful from it. Unfortunately, you can't just open() and readlines() from that type of file.
If you are going to write a module by yourself, you will need to read Sequence Alignment/Map Format Specification.
Fortunately someone already did that and created a Python module: You can go ahead and check pysam out. It will surely make your life easier.
I hope it helps.

Related

python read data from variable and write to a file (no print) which i wanted to use later as backup data

I am new to python and programming. Starting to try few things for my project..
My problem is as below
p=subprocess.Popen(Some command which gives me output],stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
p.wait()
content=p.stdout.readlines()
for line in content:
filedata=line.lstrip().rstrip()
-----> I want this filedata output to open and save it to a file.
If i use print filedata it works and gives me exactly what i wanted but i donot want to print and wanted to use this data later.
Thanks in advance..

You can do that in following two ways.
Option one uses more traditional way of file handling, I have used with statement, using with statement you don't have to worry about closing the file
Option two, which makes use of pathlib module and this is new in version 3.4 (I recommend using this)
somefile.txt is the full file path in file system. I've also included documentation links and I highly recommend going through those.
OPTION ONE
p=subprocess.Popen(Some command which gives me output],stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
p.wait()
content=p.stdout.readlines()
for line in content:
filedata=line.lstrip().rstrip()
with open('somefile.txt', 'a') as file:
file.write(filedata + '\n')
Documentation for The with Statement
OPTION TWO - For Python 3.4 or above
import pathlib
p=subprocess.Popen(Some command which gives me output],stdout=subprocess.PIPE,stderr=subprocess.STDOUT)
p.wait()
content=p.stdout.readlines()
for line in content:
filedata=line.lstrip().rstrip()
pathlib.Path('somefile.txt').write_text(filedata + '\n')
Documentation on Pathlib module

How to execute a python file using txt file as input (to parse data)

I tried looking inside stackoverflow and other sources, but could not find the solution.
I am trying to run/execute a Python script (that parses the data) using a text file as input.
How do I go about doing it?
Thanks in advance.

These basics can be found using google :)
http://pythoncentral.io/execute-python-script-file-shell/
http://www.python-course.eu/python3_execute_script.php
Since you are new to Python make sure that you read Python For Beginners
Sample code Read.py:
import sys
with open(sys.argv[1], 'r') as f:
contents = f.read()
print contents
To execute this program in Windows:
C:\Users\Username\Desktop>python Read.py sample.txt

You can try saving the input in the desired format (line-wise) file, and then using it as an alternative to STDIN (standard input file) using the file subcommand with Python
python source.py file input.txt
Note: To use it with input or source files in any other directory, use complete file location instead of file names.

How to use a python program stored in a text file?

I'm making a troubleshooting program in which I need to take a python program which is stored in a text file, but I can't use the 'import' module. To clarify this, there would be a python program stored as a '.txt' file, and in the main program I would take this text file and be able to use it as a subprogram. I've tried doing this, but I have had no clue of how to go about it, especially since I do not have much experience of Python.
Below is roughly the program. I don't know how to format it either, but here goes:
phonechoice = input("What type of phone do you have?")
if 'iphone' in phonechoice:
#here I would load a text file which contains the program for the iphone
#which asks them what problem they have with their phone and gives a solution
I'm wondering how I can do this. I thought how I could do this and maybe I could 'copy and paste' the program, line by line, into a definition, which I could then use. Would this work, and if it doesn't then in what other way could I do it?

Rename the text file to a python file, i.e. change the extension to ".py". This does not change the fact that it is a text file, just like renaming a picture.jpg file to picture.txt does not change the fact that it's an image file.
If you have some wacky requirement to import a module saved in file with a .txt extension, you can not use an import statement. But it is still possible to import like this:
import imp
my_module = imp.load_source('my_module', 'example.txt')

I am a bit reluctant to answer a "homework" type question, but I will give you some pointers on what you need to do. If I have a text file with this in it:
def main():
print("Hello")
main()
I could execute the code with the exec function like this:
with open("filename.txt") as file: #filename should be the name of the file
data = file.read()
exec(data) #this executes the code
The output would be as expected:
Hello
Hopefully this will shed some light on your problem!

Python: How to get the URL to a file when the file is received from a pipe?

I created, in Python, an executable whose input is the URL to a file and whose output is the file, e.g.,
file:///C:/example/folder/test.txt --> url2file --> the file
Actually, the URL is stored in a file (url.txt) and I run it from a DOS command line using a pipe:
type url.txt | url2file
That works great.
I want to create, in Python, an executable whose input is a file and whose output is the URL to the file, e.g.,
a file --> file2url --> URL
Again, I am using DOS and connecting executables via pipes:
type url.txt | url2file | file2url
Question: file2url is receiving a file. How do I get the file's URL (or path)?

In general, you probably can't.
If the url is not stored in the file, I seems very difficult to get the url. Imagine someone reads a text to you. Without further information you have no way to know what book it comes from.
However there are certain usecases where you can do it.
Pipe the url together with the file.
If you need the url and you can do that, try to keep the url together with the file. Make url2file pipe your url first and then the file.
Restructure your pipeline
Maybe you don't need to find the url for the file, if you restructure your pipeline.
Index your files
If only a certain files could potentially be piped into file2url, you could precalculate a hash for all files and store it in your program together with the url. In python you would do this using a dict where the key is the file (as a string) and the value is the url. You could use pickle to write the dict object to a file and load it at the start of your program.
Then you could simply lookup the url from this dict.
You might want to research how databases or search functions in explorers handle indexing or alternative solutions.
Searching for the file
You could use one significant line of the file and use something like grep or head on linux to search all files of your computer for this line. Note that grep and head are programs, not python functions. For DOS, you might need to google the equivalent programs.
FYI: grep searches for one line of text inside a file.
head puts out the first few lines of a file. I suggest comparing only the first few lines of files to avoid searching through huge file.
Searching all files on the computer might take very long.
You could only search files with the same size as your piped input.
Use url.txt
If file2url knows the location of the file url.txt, then you could look up all files in url.txt until you find a file identical to the file that was piped into your program. You could combine this with the hashing/ indexing solution.

'file2url' receives the data via standard input (like keyboard).
The data is transferred by the kernel and it doesn't necessarily have to have any file-system representation. So if there's no file there's no URL or path to that for you to get.

Let's try to do it by obvious way:
$ cat test.py | python test.py
import sys
print ''.join(sys.stdin.readlines())
print sys.stdin.name
<stdin>
So, filename is "< stdin>" because, for the python there is no filename - only input.
Another way is a system-dependent. Find a command line, which was used, for example, but no garantee that is will be works.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.

What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()

You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.

You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.