Why are my `binaryFiles` empty when I collect them in pyspark?

Why are my `binaryFiles` empty when I collect them in pyspark? - python

I have two zip files on hdfs in the same folder : /user/path-to-folder-with-zips/.
I pass that to "binaryfiles" in pyspark:
zips = sc.binaryFiles('/user/path-to-folder-with-zips/')
I'm trying to unzip the zip files and do things to the text files in them, so I tried to just see what the content will be when I try to deal with the RDD. I did it like this:
zips_collected = zips.collect()
But, when I do that, it gives an empty list:
>> zips_collected
[]
I know that the zips are not empty - they have textfiles. The documentation here says
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
What am I doing wrong here? I know I can't view the contents of the file because it is zipped and therefore binary. But, I should at least be able to see SOMETHING. Why does it not return anything?
There can be more than one file per zip file, but the contents are always something like this:
rownum|data|data|data|data|data
rownum|data|data|data|data|data
rownum|data|data|data|data|data

I'm assuming that each zip file contains a single text file (code is easily changed for multiple text files). You need to read the contents of the zip file first via io.BytesIO before processing line by line. Solution is loosely based on https://stackoverflow.com/a/36511190/234233.
import io
import gzip
def zip_extract(x):
"""Extract *.gz file in memory for Spark"""
file_obj = gzip.GzipFile(fileobj=io.BytesIO(x[1]), mode="r")
return file_obj.read()
zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')
results = zip_data.map(zip_extract) \
.flatMap(lambda zip_file: zip_file.split("\n")) \
.map(lambda line: parse_line(line))
.collect()

Related

Unpack nested tar files in s3 in streaming fashion

I've got a large tar file in s3 (10s of GBs). It contains a number of tar.gz files.
I can loop through the contents of the large file with something like
s3_client = boto3.client('s3')
input = s3_client.get_object(Bucket=bucket, Key=key)
with tarfile.open(fileobj=input['Body'],mode='r|') as tar:
print(tar) -- tarinfo
However I can't seem to open the file contents from the inner tar.gz file.
I want to be able to do this in a streaming manner rather than load the whole file into memory.
I've tried doing things like
tar.extract_file(tar.next)
But I'm not sure how this file like object is then readable.
--- EDIT
I've got slightly further with the help of #larsks.
with tarfile.open(fileobj=input_tar_file['Body'],mode='r|') as tar:
for item in tar:
m = tar.extractfile(item)
if m is not None:
with tarfile.open(fileobj=m, mode='r|gz') as gz:
for data in gz:
d = gz.extractfile(data)
However if I call .read() on d. It is empty. If I traverse through d.raw.fileobj.read() there is data. But when I write that out it's the data from all the text files in the nested tar.gz rather than one by one.

The return value of tar.extractfile is a "file-like object", just like input['Body']. That means you can simply pass that to tarfile.open. Here's a simple example that prints the contents of a nested archive:
import tarfile
with open('outside.tar', 'rb') as fd:
with tarfile.open(fileobj=fd, mode='r') as outside:
for item in outside:
with outside.extractfile(item) as inside:
with tarfile.open(fileobj=inside, mode='r') as inside_tar:
for item in inside_tar:
data = inside_tar.extractfile(item)
print('content:', data.read())
Here the "outside" file is an actual file, rather than something
coming from an S3 bucket; but I'm opening it first so that we're still
passing in fileobj when opening the outside archive.
The code iterates through the contents of the outside archive (for item in outside), and for each of these items:
Open the file using outside.extractfile()
Pass that as the argument to the fileobj parameter of
tarfile.open
Extract each item inside the nested tarfile

Unzip a file and save its content into a database

I am building a website using Django where the user can upload a .zip file. I do not know how many sub folders the file has or which type of files it contains.
I want to:
1) Unzip the file
2) Get all the file in the unzipped directory (which might contains nested sub folders)
3) Save these files (the content, not the path) into the database.
I managed to unzip the file and to output the files path.
However this is snot exactly what I want. Because I do not care about the file path but the file itself.
In addition, since I am saving the unzipped file into my media/documents, if different users upload different zip, and all the zip files are unzipped, the folder media/documents would be huge and it would impossible to know who uploaded what.
Unzipping the .zip file
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
Getting path of file in subfolders
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
views.py # It is not perfect, it is just an idea. I am just debugging.
def homeupload(request):
if request.method == "POST":
my_entity = Uploading()
# my_entity.my_uploads = request.FILES["my_uploads"]
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
my_entity.save()

You really don't have to clutter up your filesystem when using a ZipFile, as it contains methods that allow you to read the files stored in the zip, directly to memory, and then you can save those objects to a database.
Specifically, we can use .infolist() or .namelist() to get a list of all the files in the zip, and .read() to actually get their contents:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item) for item in zipObj.namelist()]
Now file_objects is a list of bytes objects that have the content of all the files. I didn't bother saving the names or file paths because you said it was unneccessary, but that can be done too. To see what you can do, check out what actually get's returned from infolist
If you want to save these bytes objects to your database, it's usually possible if your database can support it (most can). If you however wanted to get these files as plain text and not bytes, you just have to convert them first with something like .decode:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item).decode() for item in zipObj.namelist()]
Notice that we didn't save any files on our system at any point, so there's nothing to worry about a lot of user uploaded files cluttering up your system. However the database storage size on your disk will increase accordingly.

Python unzip file exclude a particular file

I am using python zipfile built-in module.
I am able to unzip a file but I need to exclude just one file.
Is there a way I can do that?
Since I am using exctractall() I am getting the excluded file too.
with ZipFile(zip_file_name, 'r') as zipObj:
# Extract all the contents of zip file in current directory
zipObj.extractall()

For doing this, I think you need to have these steps.
List the target file list for "extracting"
Add "If" condition or regex for specifying the only file you want to extract
with ZipFile(zip_file_name, 'r') as zipObj:
# Get a list of all archived file names from the zip
listOfFileNames = zipObj.namelist()
# Iterate over the file names
for fileName in listOfFileNames:
#check the excluding file condition.
if fileName is 'FILE_TO_BE_EXCLUDED.txt':
continue
zipObj.extract(fileName, 'path_for_extracting')
My reference was here.
Hope this can be helpful for you.

Getting a single file from a tar file using the tarfile lib in python

I am trying to grab a single file from a tar archive. I have the tarfile library and I can do things like find the file in a list with the right extension:
like their example:
def xml_member_files(self,members):
for tarinfo in members:
if os.path.splitext(tarinfo.name)[1] == ".xml":
yield tarinfo
member_file = self.xml_member_files(tar)
for m in member_file:
print m.name
This is great and the output is:
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutBeta.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutGamma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutSigma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/product.xml
If I say just look for product.xml then it doesn't work. So I tried this:
ti = tar.getmember('product.xml')
print ti.name
and it doesn't find product.xml because I am guessing the path information before hand. I have no idea how to retrieve just that pathing information so I can get at my product.xml file once extracted (feels like I am doing things the hard way anyway) but yah, how do I figure out just that path so I can concatenate it to my other file functions to read and load that xml file after it is the only file extracted from a tar file?

Return full path by iterating over result of getnames(). For example, to get full path for lutBeta.xml:
tar = tarfile.TarFile('mytarfile.tar')
membername = [x for x in tar.getnames() if os.path.basename(x) == 'lutBeta.xml'][0]

I would try first doing TarFile.getnames(), which I imagine works a lot like tar tzf filename.tar.gz from the command line. Then you get find out what paths to feed to your getmember() or getmembers().

You don't want to be iterating over the entire tar with getnames(), getmember() or getmembers(), because as soon as you find your file, you don't need to keep looking through the rest of the tar.
for example, it takes my machine about 47ms to extract a single file from a 2GB tar by iterating over all the file names:
with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
membername = [x for x in tar.getnames() if x.endswith('myfile.txt')][0]
file = tar.extractfile(membername).read().decode()
But stopping as soon as the file is found takes me only 0.27 ms, nearly 175x faster.
file = None
with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
for member in tar:
if member.name.endswith('myfile.txt'):
file = tar.extractfile(member).read().decode()
break
Note if the file you need is more near the end of the archive, you probably won't notice much of a change in speed, but it is still a good practice to not loop through the whole file if you don't have to.

Untar file to in memory data structure

I have .tar file which contains other tar files and some simple text files. Ideally I would like to read the entire tar file including the sub .tar files into an in memory data structure for further manipulation. I'm looking for the most efficient way to handle this. The following provides a list of the files in the first level of the tar, but I need to detect the sub .tar files and then untar them.
tar = tarfile.open("test.tar")
#print tar.getmembers()
#filenames = tar.getnames()
for file in tar:
print (file.name)
I've tried using the is_tarfile() method to check but that seems to need a filename.

To get you further, here's a recursive routine to unpack a tar into strings and try to unpack the string as a tar:
import tarfile
def unpack(filename, fileobj=None):
tar = tarfile.open(filename, fileobj=fileobj)
for file in tar.getmembers():
print (file.name)
contentfobj = tar.extractfile(file)
try:
unpack(None, fileobj=contentfobj)
except tarfile.ReadError:
# print ("not a tar")
print (contentfobj.read())
unpack("test.tar")
unpack takes a filename first time, then a fileobj provided by .extractfile() on each of the members. The last print shows how you can get the contents of the file if it is not a tar.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why are my `binaryFiles` empty when I collect them in pyspark? - python

Related

Unpack nested tar files in s3 in streaming fashion

Unzip a file and save its content into a database

Python unzip file exclude a particular file

Getting a single file from a tar file using the tarfile lib in python

Untar file to in memory data structure

Categories

Resources