Converting the output of a module from print to write mode

Converting the output of a module from print to write mode - python

I have been trying to use a twobitreader package (http://pythonhosted.org//twobitreader/) to extract DNA sequence information, however I have ran into a problem. Whenever I use twobitreader.twobit_reader() module I am only able to obtain a printed output. What I would like to do is to write the output into a new file.
This is the information on this module from http://pythonhosted.org//twobitreader/:
twobit_reader takes a twobit_file (of class TwoBitFile) and an “input_stream” which can be any iterable (incl. file-like objects) writes output (FASTA format) using write (print if write=None) logs errors/warning to stderr
Likely, my limited knowledge with python programming is impeding me from accomplishing this task.
For example, here is some code that I wrote:
def get_a(n):
"""get sequences from genome"""
genome = twobitreader.TwoBitFile('hg19.2bit')
bedfile = open(n+'.bed', 'r')
o_f = open(n+'_FASTA.txt', 'w')
twobitreader.twobit_reader(genome, bedfile)
bedfile.close()
o_f.close()
This ends up printing my sequences.
If I try to alter the twobitreader line to: twobitreader.twobit_reader(genome, bedfile, o_f) in the attempt to write the data to the file o_f, I get the error 'file' object is not callable.

OP confirmed this worked:
twobitreader.twobit_reader(genome, bedfile, o_f.write)

Related

The open() functions doesn't behave correctly with filepath containing special characters

I'm writing this simple code:
file = input('File to read: ')
fhand = open(file, 'r')
The file I want to open is called 'test.txt', and it is located in a subfolder; what I put into the requested input therefore is: 'DB\test.txt'.
Well: it doesn't work, returning this error message:
OSError: [Errno 22]Invalid argument: 'DB\test.txt'.
I have another file in the same directory, called 'my_file.txt', and I don't get errors attempting to open it. Lastly I have another file, called 'new_file.txt', and this one also gets me the same error.
What seems obvious to me is that the open() function reads the "\t" and the "\n" as if they were special characters; searching on the web I found nothing that really could help me avoiding special characters within user input strings...
Could anybody help?
Thanks!

you'll have no problems with Python 3 with your exact code (this issue is often encountered when passing windows literal strings where the r prefix is required).
With python 2, first you'll have to wrap your filename with quotes, then all special chars will be interpreted (\t, \n ...). Unless you input r"DB\test.txt" using this raw prefix I was mentionning earlier but it's beginning to become cumbersome :)
So I'd suggest to use raw_input (and don't input text with quotes). Or python version agnostic version to override the unsafe input for python 2 only:
try:
input = raw_input
except NameError:
pass
then your code will work OK and you got rid of possible code injection in your code as a bonus (See python 2 specific topic: Is it ever useful to use Python's input over raw_input?).

add header to stdout of a subprocess in python

I am merging several dataframes into one and sorting them using unix sort. Before I write the final sorted data I would like to add a prefix/header to that output.
So, my code is something like:
my_cols = '\t'.join(['CHROM', 'POS', "REF" ....])
my_cmd = ["sort", "-k1,2", "-V", "final_merged.txt"]
with open(output + 'mergedAndSorted.txt', 'w') as sort_data:
sort_data.write(my_cols + '\n')
subprocess.run(my_cmd, stdout=sort_data)
But, this above doe puts my_cols at the end of the final output file (i.e mergedAndSorted.txt)
I also tried substituting:
sort_data=io.StringIO(my_cols)
but this gives me an error as I had expected.
How can I add that header to the begining of the subprocess output. I believe this can be achieved by a simple code change.

The problem with your code is a matter of buffering; the tldr is that you can fix it like this:
sort_data.write(my_cols + '\n')
sort_data.flush()
subprocess.run(my_cmd, stdout=sort_data)
If you want to understand why it happens, and how the fix solves it:
When you open a file in text mode, you're opening a buffered file. Writes go into the buffer, and the file object doesn't necessarily flush them to disk immediately. (There's also stream-encoding from Unicode to bytes going on, but that doesn't really add a new problem, it just adds two layers where the same thing can happen, so let's ignore that.)
As long as all of your writes are to the buffered file object, that's fine—they get sequenced properly in the buffer, so they get sequenced properly on the disk.
But if you write to the underlying sort_data.buffer.raw disk file, or to the sort_data.fileno() OS file descriptor, those writes may get ahead of the ones that went to sort_data.
And that's exactly what happens when you use the file as a pipe in subprocess. This doesn't seem to be explained directly, but can be inferred from Frequently Used Arguments:
stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are PIPE, DEVNULL, an existing file descriptor (a positive integer), an existing file object, and None.
This implies pretty strongly—if you know enough about the way piping works on *nix and Windows—that it's passing the actual file descriptor/handle to the underlying OS functionality. But it doesn't actually say that. To really be sure, you have to check the Unix source and Windows source, where you can see that it is calling fileno or msvcrt.get_osfhandle on the file objects.

Unicode issues with tarfile.extractall() (Python 2.7)

I'm using python 2.7.6 on Windows and I'm using the tarfile module to extract a file a gzip file. The mode option of tarfile.open() is set to "r:gz". After the open call, if I were to print the contents of the archive via tarfile.list(), I see the following directory in the list:
./Θ¥ÖµÇüσêåµ₧É Part 1.v1/
However, after I call tarfile.extractall(), I don't see the above directory in the extracted list of files, instead I see this:
é™æ€åˆ†æž Part 1.v1/
If I were to extract the archive via 7zip, I see a directory with the same name as the first item above. So, clearly, the extractall() method is screwing up, but I don't know how to fix this.

I learned that tar doesn't retain the encoding information as part of the archive and treats filenames as raw byte sequences. So, the output I saw from tarfile.extractall() was simply raw the character sequence that comprised the file's name prior to compression. In order to get the extractall() method to recreate the original filenames, I discovered that you have to manually convert the members of the TarFile object to the appropriate encoding before calling extractall(). In my case, the following did the trick:
modeltar = tarfile.open(zippath, mode="r:gz")
updatedMembers = []
for m in modeltar.getmembers():
m.name = unicode(m.name, 'utf-8')
updatedMembers.append(m)
modeltar.extractall(members=updatedMembers, path=dbpath)
The above code is based on this superuser answer: https://superuser.com/a/190786/354642

Python: how to capture output to a text file? (only 25 of 530 lines captured now)

I've done a fair amount of lurking on SO and a fair amount of searching and reading, but I must also confess to being a relative noob at programming in general. I am trying to learn as I go, and so I have been playing with Python's NLTK. In the script below, I can get everything to work, except it only writes what would be the first screen of a multi-screen output, at least that's how I am thinking about it.
Here's the script:
#! /usr/bin/env python
import nltk
# First we have to open and read the file:
thefile = open('all_no_id.txt')
raw = thefile.read()
# Second we have to process it with nltk functions to do what we want
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
# Now we can actually do stuff with it:
concord = text.concordance("cultural")
# Now to save this to a file
fileconcord = open('ccord-cultural.txt', 'w')
fileconcord.writelines(concord)
fileconcord.close()
And here's the beginning of the output file:
Building index...
Displaying 25 of 530 matches:
y .   The Baobab Tree : Stories of Cultural Continuity The continuity evident
regardless of ethnicity , and the cultural legacy of Africa as well . This Af
What am I missing here to get the entire 530 matches written to the file?

text.concordance(self, word, width=79, lines=25) seem to have other parameters as per manual.
I see no way to extract the size of concordance index, however, the concordance printing code seem to have this part: lines = min(lines, len(offsets)), therefore you can simply pass sys.maxint as a last argument:
concord = text.concordance("cultural", 75, sys.maxint)
Added:
Looking at you original code now, I can't see a way it could work before. text.concordance does not return anything, but outputs everything to stdout using print. Therefore, the easy option would be redirection stdout to you file, like this:
import sys
....
# Open the file
fileconcord = open('ccord-cultural.txt', 'w')
# Save old stdout stream
tmpout = sys.stdout
# Redirect all "print" calls to that file
sys.stdout = fileconcord
# Init the method
text.concordance("cultural", 200, sys.maxint)
# Close file
fileconcord.close()
# Reset stdout in case you need something else to print
sys.stdout = tmpout
Another option would be to use the respective classes directly and omit the Text wrapper. Just copy bits from here and combine them with bits from here and you are done.

Update:
I found this write text.concordance output to a file Options
from the ntlk usergroup. It's from 2010, and states:
Documentation for the Text class says: "is intended to support
initial exploration of texts (via the interactive console). ... If you
wish to write a program which makes use of these analyses, then you
should bypass the Text class, and use the appropriate analysis
function or class directly instead."
If nothing has changed in the package since then, this may be the source of your problem.
--- previously ---
I don't see a problem with writing to the file using writelines():
file.writelines(sequence)
Write a sequence of strings to the file. The sequence can be any
iterable object producing strings, typically a list of strings. There
is no return value. (The name is intended to match readlines();
writelines() does not add line separators.)
Note the italicized part, did you examine the output file in different editors? Perhaps the data is there, but not being rendered correctly due to missing end of line seperators?
Are you sure this part is generating the data you want to output?
concord = text.concordance("cultural")
I'm not familiar with nltk, so I'm just asking as part of eliminating possible sources for the problem.

Embed pickle (or arbitrary) data in python script

In Perl, the interpreter kind of stops when it encounters a line with
__END__
in it. This is often used to embed arbitrary data at the end of a perl script. In this way the perl script can fetch and store data that it stores 'in itself', which allows for quite nice opportunities.
In my case I have a pickled object that I want to store somewhere. While I can use a file.pickle file just fine, I was looking for a more compact approach (to distribute the script more easily).
Is there a mechanism that allows for embedding arbitrary data inside a python script somehow?

With pickle you can also work directly on strings.
s = pickle.dumps(obj)
pickle.loads(s)
If you combine that with """ (triple-quoted strings) you can easily store any pickled data in your file.

If the data is not particularly large (many K) I would just .encode('base64') it and include that in a triple-quoted string, with .decode('base64') to get back the binary data, and a pickle.loads() call around it.

In Python, you can use """ (triple-quoted strings) to embed long runs of text data in your program.
In your case, however, don't waste time on this.
If you have an object you've pickled, you'd be much, much happier dumping that object as Python source and simply including the source.
The repr function, applied to most objects, will emit a Python source-code version of the object. If you implement __repr__ for all of your custom classes, you can trivially dump your structure as Python source.
If, on the other hand, your pickled structure started out as Python code, just leave it as Python code.

I made this code. You run something like python comp.py foofile.tar.gz, and it creates decomp.py, with foofile.tar.gz's contents embedded in it. I don't think this is really portable with windows because of the Popen though.
import base64
import sys
import subprocess
inf = open(sys.argv[1],"r+b").read()
outs = base64.b64encode(inf)
decomppy = '''#!/usr/bin/python
import base64
def decomp(data):
fname = "%s"
outf = open(fname,"w+b")
outf.write(base64.b64decode(data))
outf.close()
# You can put the rest of your code here.
#Like this, to unzip an archive
#import subprocess
#subprocess.Popen("tar xzf " + fname, shell=True)
#subprocess.Popen("rm " + fname, shell=True)
''' %(sys.argv[1])
taildata = '''uudata = """%s"""
decomp(uudata)
''' %(outs)
outpy = open("decomp.py","w+b")
outpy.write(decomppy)
outpy.write(taildata)
outpy.close()
subprocess.Popen("chmod +x decomp.py",shell=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.