Using Databricks/Python3.x ZipFile to extract 7gb file from zip

Using Databricks/Python3.x ZipFile to extract 7gb file from zip - python

I've got a large NPI zipfile which includes a 7.3gb csv. (file can be located on NPI site here: http://download.cms.gov/nppes/NPI_Files.html -- the Full Replacement Monthly NPI File)
When using extractall, every file is extracted to the proper location and all files are correct, with the exception of that 7gb file. It only extracts to 108.9 KB.
Here's the code...
with zipfile.ZipFile(sourcePath, mode='r') as zip_ref:
zip_ref.extractall(destinationPath)
I even added ", allowZip64=True" just in case, but it still only unzips the file to 108k.
Any idea what I can be doing wrong here?

Related

Compressing Excel file does not reduce the size

I am using python to automate the conversion of Excel file to zip folder. The file is converted to zip using below code.
import os
from zipfile import ZipFile
path = r'C:/Users/kj/Desktop/Results'
os.chdir(path)
new_path = "C:/Users/kj/Desktop/New folder/"
ZipFile(new_path+"Week6.zip", 'w').write("Week6.csv")
The problem is below code is converting it to zip file but not reducing the size of the file. It remains same. But when I manually zip file, it reduces the size considerably. Please suggest what can be done to do it manually?

Because the default compression algorithm is ZIP_STORED which basically saves exact files in uncompressed format. so you have to mention your compression algorithm like lzma:
ZipFile(new_path+"Week6.zip", mode='w', compression=ZIP_LZMA)

Limit on bz2 file decompression using python?

I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze. There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.
My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB. I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else. I have tried this with different files and I know that it works for files smaller than 900 kb. Has anyone else had a similar problem and knows of a solution?
My code is below:
import numpy as np
import bz2
import os
import glob
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
with bz2.BZ2File(zipped_file,'rb') as zipfile: #Read in the bz2 files
newfilepath = cpath +'/temp/'+zipped_file[-47:-4] #create a temporary file
with open(newfilepath, "wb") as tmpfile: #open the temporary file
for i,line in enumerate(zipfile.readlines()):
tmpfile.write(line) #write the data from the compressed file to the temporary file
filenames_.append(newfilepath)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)
It returns the correct file paths with the wrong sizes capped at 900 kb.

It turns out this issue is due to the files being multi stream which does not work in python 2.7. There is more info here as mentioned by jasonharper and here. Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want. It is not as pretty but it works.
import numpy as np
import os
import glob
import shutil
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
newfilepath = cpath +'/temp/' #create a temporary file
newfilename = newfilepath + zipped_file[-47:-4]
os.popen('bzip2 -kd ' + zipped_file)
shutil.move(zipped_file[-47:-4],newfilepath)
filenames_.append(newfilename)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'
unzip_f(path_)

This is a known limitation in Python2, where the BZ2File class doesn't support multiple streams.
This can be easily resolved by using bz2file, https://pypi.org/project/bz2file/, which is a backport of Python3 implementation and can be used as a drop-in replacement.
After running pip install bz2file you can just replace bz2 with it:
import bz2file as bz2 and everything should just work :)
The original Python bug report: https://bugs.python.org/issue1625

Determine Filename of Unzipped File

Say you unzip a file called file123.zip with zipfile.ZipFile, which yields an unzipped file saved to a known path. However, this unzipped file has a completely random name. How do you determine this completely random filename? Or is there some way to control what the name of the unzipped file is?
I am trying to implement this in python.

By "random" I assume that you mean that the files are named arbitrarily.
You can use ZipFile.read() which unzips the file and returns its contents as a string of bytes. You can then write that string to a named file of your choice.
from zipfile import ZipFile
with ZipFile('file123.zip') as zf:
for i, name in enumerate(zf.namelist()):
with open('outfile_{}'.format(i), 'wb') as f:
f.write(zf.read(name))
This will write each file from the archive to a file named output_n in the current directory. The names of the files contained in the archive are obtained with ZipFile.namelist(). I've used enumerate() as a simple method of generating the file names, however, you could substitute that with whatever naming scheme you require.

If the filename is completely random you can first check for all filenames in a particular directory using os.listdir(). Now you know the filename and can do whatever you want with it :)
See this topic for more information.

Unzip folder by chunks in python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.

You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/

You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

How to get a list of files from the archive (rar or zip) attached to the E-mail using Python?

How to get a list of files from the archive (rar or zip) attached to the E-mail using Python? That is, I have a EML file. I do not need to unzip the files just to get a list. Theoretically possible option when attached very large file and process the extracted attachments can take a lot of time and resources.

Here's how to do that with the stdlib, getting the first attachment in a simple multi-part message stored as message.eml:
import email.parser
import StringIO
import zipfile
with open('message.eml') as f:
msg = email.parser.Parser().parse(f)
attachment = msg.get_payload(1)
zipf = StringIO.StringIO(attachment.get_payload())
zip = zipfile.ZipFile(zipf)
filenames = zip.namelist()
This will parse the whole MIME envelope, decode the whole attachment, and read the ZIP directory of that attachment… but at least it won't uncompress any of the files in the ZIP archive, so I suspect you won't actually have any performance problem to worry about.

This answer tells you how to get the file object (for a zip archive, use the ZipFile constructor to open the file, rather than the normal open() function). Then you can use zipfile.namelist() to get the names of the archive members

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Databricks/Python3.x ZipFile to extract 7gb file from zip - python

Related

Compressing Excel file does not reduce the size

Limit on bz2 file decompression using python?

Determine Filename of Unzipped File

Unzip folder by chunks in python

How to get a list of files from the archive (rar or zip) attached to the E-mail using Python?

Categories

Resources