Extracting a text file from tar with tarfile module in python3 - python

Is there a simple way to extract a text file from a tar file as a file object of text I/O in python 3.4 or later?
I am revising my python2 code to python3, and I found TarFile.extractfile, which used to return a file object with text I/O, now returns a io.BufferedReader object which seems to have binary I/O. The other part of my code expects a text I/O, and I need to absorb this change in some way.
One method I can think of is to use TarFile.extract and write the file to a directory, and open it by open function, but I wonder if there is a way to get the text I/O stream directly.

Try io.TextIOWrapper to wrap the io.BufferedReader.

you can use getmembers()
import tarfile
tar = tarfile.open("test.tar")
tar.getmembers()
After that, you can use extractfile() to extract the members as file object. Just an example
import tarfile,os
import sys
os.chdir("/tmp/foo")
tar = tarfile.open("test.tar")
for member in tar.getmembers():
f=tar.extractfile(member)
content=f.read()
// do operations with your content
sys.exit()
tar.close()

Related

How to read and change BAM files from a Python script?

I'm planning on using a Python script to change different BAM (Binary Alignment Map) file headers. Right now I am just testing the output of one bam file but every time I want to check my output, the stdout is not human readable. How can I see the output of my script? Should I use samtools view bam.file on my script? This is my code.
#!/usr/bin/env python
import os
import subprocess
if __name__=='__main__':
for file in os.listdir(os.getcwd()):
if file == "SRR4209928.bam":
with open("SRR4209928.bam", "r") as input:
content = input.readlines()
for line in content:
print(line)
Since BAM is a binary type of SAM, you will need to write something that knows how to deal with the compressed data before you can extract something meaningful from it. Unfortunately, you can't just open() and readlines() from that type of file.
If you are going to write a module by yourself, you will need to read Sequence Alignment/Map Format Specification.
Fortunately someone already did that and created a Python module: You can go ahead and check pysam out. It will surely make your life easier.
I hope it helps.

Piping from Python's ftplib without blocking

Ideally what I'd like to do is replicate this bash pipeline in python (I'm using cut here to represent some arbitrary transformation of the data. I actually want to use pandas to do this):
curl ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz | gunzip | cut -f 1,2,4
I can write the following code in python, which achieves the same goal
# Download the zip file into memory
file = io.BytesIO()
ftp = ftplib.FTP('hgdownload.cse.ucsc.edu')
ftp.retrbinary(f'RETR goldenPath/{args.reference}/database/refFlat.txt.gz', file.write)
# Unzip the gzip file
table = gzip.GzipFile(fileobj=file)
# Read into pandas
df = pd.read_csv(table)
However, the ftp.retrbinary() call blocks, and waits for the whole download. What I want is to have one long binary stream, with the FTP file as the source, with a gunzip as a filter, and with pd.read_csv() as a sink, all simultaneously processing data, as in my bash pipeline. Is there a way to stop retrbinary() from blocking?
I realise this may be impossible because python can't use more than one thread. Is this true? If so, can I use multiprocessing or async or some other language feature to achieve this simultaneous pipeline
edit: changed storbinary to retrbinary, this was a typo and the problem still stands
You should be able to download the file directly to the GZipFile:
gzipfile = gzip.GzipFile()
ftp.retrbinary(f'RETR goldenPath/{args.reference}/database/refFlat.txt.gz', gzipfile.write)

How to execute a python file using txt file as input (to parse data)

I tried looking inside stackoverflow and other sources, but could not find the solution.
I am trying to run/execute a Python script (that parses the data) using a text file as input.
How do I go about doing it?
Thanks in advance.
These basics can be found using google :)
http://pythoncentral.io/execute-python-script-file-shell/
http://www.python-course.eu/python3_execute_script.php
Since you are new to Python make sure that you read Python For Beginners
Sample code Read.py:
import sys
with open(sys.argv[1], 'r') as f:
contents = f.read()
print contents
To execute this program in Windows:
C:\Users\Username\Desktop>python Read.py sample.txt
You can try saving the input in the desired format (line-wise) file, and then using it as an alternative to STDIN (standard input file) using the file subcommand with Python
python source.py file input.txt
Note: To use it with input or source files in any other directory, use complete file location instead of file names.

Writing PDFs to STDOUT with Python

I want to merge two PDF documents with Python (prepend a pre-made cover sheet to an existing document) and present the result to a browser. I'm currently using the PyPDF2 library which can perform the merge easily enough, but the PdfFileWriter class write() method only seems to support writing to a file object (must support write() and tell() methods). In this case, there is no reason to touch the filesystem; the merged PDF is already in memory and I just want to send a Content-type header and then the document to STDOUT (the browser via CGI). Is there a Python library better suited to writing a document to STDOUT than PyPDF2? Alternately, is there a way to pass STDIO as an argument to PdfFileWriter's write() method in such a way that it appears to write() as though it were a file handle?
Letting write() write the document to the filesystem and then opening the resulting file and sending it to the browser works, but is not an option in this case (aside from being terribly inelegant).
solution
Using mgilson's advice, this is how I got it to work in Python 2.7:
#!/usr/bin/python
import cStringIO
import sys
from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
###
# Actual PDF open/merge code goes here
###
output = cStringIO.StringIO()
merger.write(output)
print("Content-type: application/pdf\n")
sys.stdout.write(output.getvalue())
output.close()
Python supports an "in-memory" filetype via cStringIO.StringIO (or io.BytesIO, ... depending on python version). In your case, you could create an instance of one of those classes, pass that to the method which expects a file and then you can use the .getvalue() method to return the contents as a string (or bytes depending on python version). Once you have the contents as a string, you can simply print them or use sys.stdout.write to write the string to standard output.

Why do scripts behave differently called from commandline vs git attribuites?

Updated scripts attached below, these are now working on my sample document
Why do the following python scripts perform differently when called via git attributes or from command line?
What I have are two scripts that I modeled based on the mercurial zipdoc functionality. All I'm attempting to do is unzip docx files on store (filter.clean) and zip them on restore (filter.smudge). I have two scripts working well, but once I put them into git attribute they don't work and I don't understand why.
I've tested by doing the following
Testing the Scripts (git bash)
$ cat original.docx | python ~/Documents/pyscripts/unzip.py >
uncompress.docx
$ cat uncompress.docx | python
~/Documents/pyscripts/zip.py > compress.docx
$ md5sum uncompress.docx compress.docx
I can open both the uncompressed and compressed files with Microsoft Word with no errors. The scripts work as expected.
Test Git Attributes
I set both clean and scrub to cat, verified my file saves and restores w/o problem.
I set clean to python ~/Documents/pyscripts/unzip.py. After a commit and checkout the file is now larger (uncompressed) but errors when opened in MS-Word. Also the md5 does not match the "script only" test above. Although the file size is identical.
I set clean back to cat and set scrub to python ~/Documents/pyscripts/zip.py. After a commit and checkout the file is now smaller (compressed) but again errors when opened in MS-Word. Again the md5 differs from the "script only" test but the file size matches.
Setting both clean and scrub to the python scripts produces an error, as expected.
I'm really lost here, I thought git Attributes simply provides input on stdin and reads it from stdout. I've tested both scripts to work with a pipe from cat and a redirect from the output just fine. I know the scripts are running b/c the files change size as expected, however a small change is introduced somewhere in the file.
Additional References
I'm using msgit on Win7, all commands above were typed into the git bash window.
Git Attributes Description
Uncompress Script
import fileinput
import sys
import zipfile
# Set stdin and stdout to binary read/write
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY)
try:
from cStringIO import StringIO
except:
from StringIO import StringIO
# Wrap stdio into a file like object
inString = StringIO(sys.stdin.read())
outString = StringIO()
# Store each member uncompressed
try:
with zipfile.ZipFile(inString,'r') as inFile:
outFile = zipfile.ZipFile(outString,'w',zipfile.ZIP_STORED)
for memberInfo in inFile.infolist():
member = inFile.read(memberInfo)
memberInfo.compress_type = 0
outFile.writestr(memberInfo,member)
outFile.close()
except zipfile.BadZipfile:
sys.stdout.write(inString.getvalue())
sys.stdout.write(outString.getvalue())
Compress Script
import fileinput
import sys
import zipfile
# Set stdin and stdout to binary read/write
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY)
try:
from cStringIO import StringIO
except:
from StringIO import StringIO
# Wrap stdio into a file like object
inString = StringIO(sys.stdin.read())
outString = StringIO()
# Store each member compressed
try:
with zipfile.ZipFile(inString,'r') as inFile:
outFile = zipfile.ZipFile(outString,'w',zipfile.ZIP_DEFLATED)
for memberInfo in inFile.infolist():
member = inFile.read(memberInfo)
memberInfo.compress_type = zipfile.ZIP_DEFLATED
outFile.writestr(memberInfo,member)
outFile.close()
except zipfile.BadZipfile:
sys.stdout.write(inString.getvalue())
sys.stdout.write(outString.getvalue())
Edit: Formatting
Edit 2: Scripts updated to write to memory rather than stdout during file processing.
I've found that using zipfile.ZipFile() with the target being stdout was causing a problem. Opening the zipfile with the target being a StringIO() and then at the end writing the full StringIO file into stdout has solved that problem.
I haven't tested this extensively and it's possible some .docx contents won't be handled well but only time will tell. My test files now open with out error, and as a bonus the .docx file in your working directory is smaller due to using higher compression than the standard .docx format.
I have confirmed that after performing several edits and commits on a .docx file I can open the file, edit one line, and commit with out a large delta added to the repo size. For example, a 19KB file, after 3 previous edits in the repo history, having a new line added at the top created a delta of only 1KB in the repo after performing garbage collection. Running the same test (as close as I could) with Mercurial resulted in a 9.3KB delta commit. I'm no Mercurial expert my understanding is there is no "gc" command for mercurial so none was run.

Categories

Resources