Unpack nested tar files in s3 in streaming fashion - python

I've got a large tar file in s3 (10s of GBs). It contains a number of tar.gz files.
I can loop through the contents of the large file with something like
s3_client = boto3.client('s3')
input = s3_client.get_object(Bucket=bucket, Key=key)
with tarfile.open(fileobj=input['Body'],mode='r|') as tar:
print(tar) -- tarinfo
However I can't seem to open the file contents from the inner tar.gz file.
I want to be able to do this in a streaming manner rather than load the whole file into memory.
I've tried doing things like
tar.extract_file(tar.next)
But I'm not sure how this file like object is then readable.
--- EDIT
I've got slightly further with the help of #larsks.
with tarfile.open(fileobj=input_tar_file['Body'],mode='r|') as tar:
for item in tar:
m = tar.extractfile(item)
if m is not None:
with tarfile.open(fileobj=m, mode='r|gz') as gz:
for data in gz:
d = gz.extractfile(data)
However if I call .read() on d. It is empty. If I traverse through d.raw.fileobj.read() there is data. But when I write that out it's the data from all the text files in the nested tar.gz rather than one by one.

The return value of tar.extractfile is a "file-like object", just like input['Body']. That means you can simply pass that to tarfile.open. Here's a simple example that prints the contents of a nested archive:
import tarfile
with open('outside.tar', 'rb') as fd:
with tarfile.open(fileobj=fd, mode='r') as outside:
for item in outside:
with outside.extractfile(item) as inside:
with tarfile.open(fileobj=inside, mode='r') as inside_tar:
for item in inside_tar:
data = inside_tar.extractfile(item)
print('content:', data.read())
Here the "outside" file is an actual file, rather than something
coming from an S3 bucket; but I'm opening it first so that we're still
passing in fileobj when opening the outside archive.
The code iterates through the contents of the outside archive (for item in outside), and for each of these items:
Open the file using outside.extractfile()
Pass that as the argument to the fileobj parameter of
tarfile.open
Extract each item inside the nested tarfile

Related

How can I save all the generated images from the following code to a Zip file using python?

I need to decode the data in the content_arrays list and generate an image , The following code does that
content_arrays = ['ljfdslkfjaslkfjsdlf' , 'sdfasfsdfsdfsafs'] // Contains a list of base64 encoded data
i=0
for content in content_arrays:
img_data = (content_arrays[i])
with open(filename, "wb") as fh:
fh.write(base64.b64decode(img_data))
i=i+1
How can I store all the generated images directly to a single zip file which contains all the images that are generated by decoding the base64 string from the above list[content_arrays].
Current File Structure of the downloaded data ::
-- Desktop
-- image1.png
-- image2.png
Required File Structure of the downloaded data ::
-- Desktop
-- Data.zip
-- image1.png
-- image2.png
I've used python zipfile module , but couldn't figure out things.
If there is any possible way , please do give your suggestions ..
you can just use the zipfile module and then write the content to separate files in the zip. In this example i am just writing the content to a file inside the zip for each item in contents list. I am also using the writestr method here so i dont have to have physical files on the disk i can just create my content in memory and write it in my zip rather then having to first create it as a file on the OS and then write the file in the zip
from zipfile import ZipFile
with ZipFile("data.zip", "w") as my_zip:
content_arrays = ['ljfdslkfjaslkfjsdlf', 'sdfasfsdfsdfsafs']
for index, content in enumerate(content_arrays):
#do what ever you need to do here with your content
my_zip.writestr(f'file_{index}.txt', content)
OUTPUT
In your case, you can iterate over the list of filenames
with ZipFile('images.zip', 'w') as zip_obj:
# Add multiple files to the zip
for filename in filenames:
zip_obj.write(filename)

Unzip a file and save its content into a database

I am building a website using Django where the user can upload a .zip file. I do not know how many sub folders the file has or which type of files it contains.
I want to:
1) Unzip the file
2) Get all the file in the unzipped directory (which might contains nested sub folders)
3) Save these files (the content, not the path) into the database.
I managed to unzip the file and to output the files path.
However this is snot exactly what I want. Because I do not care about the file path but the file itself.
In addition, since I am saving the unzipped file into my media/documents, if different users upload different zip, and all the zip files are unzipped, the folder media/documents would be huge and it would impossible to know who uploaded what.
Unzipping the .zip file
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
Getting path of file in subfolders
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
views.py # It is not perfect, it is just an idea. I am just debugging.
def homeupload(request):
if request.method == "POST":
my_entity = Uploading()
# my_entity.my_uploads = request.FILES["my_uploads"]
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
my_entity.save()
You really don't have to clutter up your filesystem when using a ZipFile, as it contains methods that allow you to read the files stored in the zip, directly to memory, and then you can save those objects to a database.
Specifically, we can use .infolist() or .namelist() to get a list of all the files in the zip, and .read() to actually get their contents:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item) for item in zipObj.namelist()]
Now file_objects is a list of bytes objects that have the content of all the files. I didn't bother saving the names or file paths because you said it was unneccessary, but that can be done too. To see what you can do, check out what actually get's returned from infolist
If you want to save these bytes objects to your database, it's usually possible if your database can support it (most can). If you however wanted to get these files as plain text and not bytes, you just have to convert them first with something like .decode:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item).decode() for item in zipObj.namelist()]
Notice that we didn't save any files on our system at any point, so there's nothing to worry about a lot of user uploaded files cluttering up your system. However the database storage size on your disk will increase accordingly.

is it possible to collect comment data form multiple zip files without unzipping?

Hello is it possible to collect the comment data of a zip file from multiple files?(as the optional comment you get on the side when opening a Zip or a Rar file)
and if so, where exactly does the comment gets stored?
You can do something like:
from zipfile import ZipFile
zipfiles = ["example.zip",]
for zfile in zipfiles:
print("Opening: {}".format(zfile))
with ZipFile(zfile, 'r') as testzip:
print(testzip.comment) # comment for entire zip
l = testzip.infolist() #list all files in archive
for finfo in l:
# per file/directory comments
print("{}:{}".format(finfo.filename, finfo.comment))
Check http://www.artpol-software.com/ZipArchive/KB/0610242300.aspx for more information on how and where metadata is stored in zip files.

Why are my `binaryFiles` empty when I collect them in pyspark?

I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/.
I pass that to "binaryfiles" in pyspark:
zips = sc.binaryFiles('/user/path-to-folder-with-zips/')
I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:
zips_collected = zips.collect()
But, when I do that, it gives an empty list:
>> zips_collected
[]
I know that the zips are not empty - they have textfiles. The documentation here says
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?
There can be more than one file per zip file, but the contents are always something like this:
rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data
I'm assuming that each zip file contains a single text file (code is easily changed for multiple text files). You need to read the contents of the zip file first via io.BytesIO before processing line by line. Solution is loosely based on https://stackoverflow.com/a/36511190/234233.
import io
import gzip
def zip_extract(x):
"""Extract *.gz file in memory for Spark"""
file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
return file_obj.read()
zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
.flatMap(lambda zip_file: zip_file.split("\n")) \
.map(lambda line: parse_line(line))
.collect()

Untar file to in memory data structure

I have .tar file which contains other tar files and some simple text files. Ideally I would like to read the entire tar file including the sub .tar files into an in memory data structure for further manipulation. I'm looking for the most efficient way to handle this. The following provides a list of the files in the first level of the tar, but I need to detect the sub .tar files and then untar them.
tar = tarfile.open("test.tar")
#print tar.getmembers()
#filenames = tar.getnames()
for file in tar:
print (file.name)
I've tried using the is_tarfile() method to check but that seems to need a filename.
To get you further, here's a recursive routine to unpack a tar into strings and try to unpack the string as a tar:
import tarfile
def unpack(filename, fileobj=None):
tar = tarfile.open(filename, fileobj=fileobj)
for file in tar.getmembers():
print (file.name)
contentfobj = tar.extractfile(file)
try:
unpack(None, fileobj=contentfobj)
except tarfile.ReadError:
# print ("not a tar")
print (contentfobj.read())
unpack("test.tar")
unpack takes a filename first time, then a fileobj provided by .extractfile() on each of the members. The last print shows how you can get the contents of the file if it is not a tar.

Categories

Resources