I have the following code but obviously this is not real streaming. It is the best I could find but it reads the whole input file into memory first. I want to stream it to tarfile module without using all my memory when decrypting huge (>100Gb files)
import tarfile, gnupg
gpg = gnupg.GPG(gnupghome='C:/Users/niels/.gnupg')
with open('103330-013.tar.gpg', 'r') as input_file:
decrypted_data = gpg.decrypt(input_file.read(), passphrase='aaa')
# decrypted_data.data contains the data
decrypted_stream = io.BytesIO(decrypted_data.data)
tar = tarfile.open(decrypted_stream, mode='r|')
tar.extractall()
tar.close()
Apparently, you cannot use real streaming using gpnupg module, gnupg module always reads whole output of gnupg into memory.
So to use real streaming, you'll have to run gpg program directly.
Here is a sample code (without proper error handling):
import subprocess
import tarfile
with open('103330-013.tar.gpg', 'r') as input_file:
gpg = subprocess.Popen(("gpg", "--decrypt", "--homedir", 'C:/Users/niels/.gnupg', '--passphrase', 'aaa'), stdin=input_file, stdout=subprocess.PIPE)
tar = tarfile.open(fileobj=gpg.stdout, mode="r|")
tar.extractall()
tar.close()
Related
I have a large file s3://my-bucket/in.tsv.gz that I would like to load and process, write back its processed version to an s3 output file s3://my-bucket/out.tsv.gz.
How do I streamline the in.tsv.gz directly from s3 without loading all the file to memory (it cannot fit the memory)
How do I write the processed gzipped stream directly to s3?
In the following code, I show how I was thinking to load the input gzipped dataframe from s3, and how I would write the .tsv if it were located locally bucket_dir_local = ./.
import pandas as pd
import s3fs
import os
import gzip
import csv
import io
bucket_dir = 's3://my-bucket/annotations/'
df = pd.read_csv(os.path.join(bucket_dir, 'in.tsv.gz'), sep='\t', compression="gzip")
bucket_dir_local='./'
# not sure how to do it with an s3 path
with gzip.open(os.path.join(bucket_dir_local, 'out.tsv.gz'), "w") as f:
with io.TextIOWrapper(f, encoding='utf-8') as wrapper:
w = csv.DictWriter(wrapper, fieldnames=['test', 'testing'], extrasaction="ignore")
w.writeheader()
for index, row in df.iterrows():
my_dict = {"test": index, "testing": row[6]}
w.writerow(my_dict)
Edit: smart_open looks like the way to go.
Here is a dummy example to read a file from s3 and write it back to s3 using smart_open
from smart_open import open
import os
bucket_dir = "s3://my-bucket/annotations/"
with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
with open(
os.path.join(bucket_dir, "out.tsv.gz"), "wb"
) as fout:
for line in fin:
l = [i.strip() for i in line.decode().split("\t")]
string = "\t".join(l) + "\n"
fout.write(string.encode())
For downloading the file you can stream the S3 object directly in python. I'd recommend reading that entire post but some key lines from it
import boto3
s3 = boto3.client('s3', aws_access_key_id='mykey', aws_secret_access_key='mysecret') # your authentication may vary
obj = s3.get_object(Bucket='my-bucket', Key='my/precious/object')
import gzip
body = obj['Body']
with gzip.open(body, 'rt') as gf:
for ln in gf:
process(ln)
Unfortunately S3 doesn't support true streaming input but this SO answer has an implementation that chunks out the file and sends each chunk up to S3. While not a "true stream" it will let you upload large files without needing to keep the entire thing in memory
I'm exploring file compression options, and am confused by the behavior of the gzip module in Python. I can write a gzipped file like this:
with gzip.open('test.txt.gz', 'wb') as out:
for i in range(100):
out.write(bytes(i))
But if I then run gunzip test.txt.gz the output (test.txt) is still binary. What am I missing?
Ah, this works properly in Python 2.7:
import gzip
with gzip.open('test.txt.gz', 'wb') as out:
for i in range(100):
out.write(bytes(i))
In Python 3, we have to do:
import io, gzip
with gzip.open('test.txt.gz', 'wb') as output:
with io.TextIOWrapper(output, encoding='utf-8') as writer:
for i in range(100):
writer.write(str(i))
While the code you posted for 2.7 works fine, A simple way to fix this for 3.X would be:
import gzip
with gzip.open('test.txt.gz', 'wb') as out:
for i in range(100):
out.write(str(i).encode("utf-8"))
I dont know how to unzip gz file in python using subprocess.
gzip library is so slow and i was thinking to make the same function above using gnu/linux shell code and subprocess library.
def __unzipGz(filePath):
import gzip
inputFile = gzip.GzipFile(filePath, 'rb')
stream = inputFile.read()
inputFile.close()
outputFile = file(os.path.splitext(filePath)[0], 'wb')
outputFile.write(stream)
outputFile.close()
You can use something like this:
import subprocess
filename = "some.gunzip.file.tar.gz"
output = subprocess.Popen(['tar', '-xzf', filename])
Since there is no much useful output here, You could also use os.system instead of subprocess.Popen like this:
import os
filename = "some.gunzip.file.tar.gz"
exit_code = os.system("tar -xzf {}".format(filename))
I'm extracting a tarball using the tarfile module of python. I don't want the extracted files to be written on the disk, but rather get piped directly to another program, specifically bgzip. I'm also trying to use StringIO for that matter, but I get stuck even on that stage - the tarball gets extracted on the disk.
#!/usr/bin/env python
import tarfile, StringIO
tar = tarfile.open("6genomes.tgz", "r:gz")
def enafun(members):
for tarkati in tar:
if tarkati.isreg():
yield tarkati
reles = StringIO.StringIO()
reles.write(tar.extractall(members=enafun(tar)))
tar.close()
How then do I pipe correctly the output of tar.extractall?
You cannot use extractall method, but you can use getmembers and extractfile methods instead:
#!/usr/bin/env python
import tarfile, StringIO
reles = StringIO.StringIO()
with tarfile.open("6genomes.tgz", "r:gz") as tar:
for m in tar.members():
if m.isreg():
reles.write(tar.extractfile(m).read())
# do what you want with "reles".
According to the documentation, extractfile() method can take a TarInfo and will return a file-like object. You can then get the content of that file with read().
[EDIT] I add what you asked me in comment as formatting in comment seems not to render properly.
#!/usr/bin/env python
import tarfile
import subprocess
with tarfile.open("6genomes.tgz", "r:gz") as tar:
for m in tar.members():
if m.isreg():
f = tar.extractfile(m)
new_filename = generate_new_filename(f.name)
with open(new_filename, 'wb') as new_file:
proc = subprocess.Popen(['bgzip', '-c'], stdin=subprocess.PIPE, stdout=new_file)
proc.stdin.write(f.read())
proc.stdin.close()
proc.wait()
f.close()
I am using Docker python client API 'copy'.
Response from copy is of type requests.packages.urllib3.HTTPResponse
Does it need to be handled differently for different types of file?
I copied a text file from container but when I try to read it using
response.read() I am getting text data mixed with binary data.
I see content decoders as
>>>resonse.CONTENT_DECODERS
>>>['gzip', 'deflate']
What is the best way to handle/read/dump the response from copy API ?
The response from the docker API is an uncompressed tar file. I had to read docker's source code to know the format of the response, as this is not documented. For instance, to download a file at remote_path, you need to do the following:
import tarfile, StringIO, os
reply = docker.copy(container, remote_path)
filelike = StringIO.StringIO(reply.read())
tar = tarfile.open(fileobj = filelike)
file = tar.extractfile(os.path.basename(remote_path))
print file.read()
The code should be modified to work on folders.
Here my python 3 version with docker API 1.38, the copy API seems to be replaced by get_archive.
archive, stat = client.get_archive(path)
filelike = io.BytesIO(b"".join(b for b in archive))
tar = tarfile.open(fileobj=filelike)
fd = tar.extractfile(stat['name'])
Adjusting #Apr's answer for Python 3:
import tarfile, io, os
def copy_from_docker(client, container_id, src, dest):
reply = client.copy(container_id, src)
filelike = io.BytesIO(reply.read())
tar = tarfile.open(fileobj = filelike)
file = tar.extractfile(os.path.basename(src))
with open(dest, 'wb') as f:
f.write(file.read())