How to speed up zip file extraction in Python

How to speed up zip file extraction in Python - python

I'm trying to extract data from a zip file in Python, but it's kind of slow. Could anyone advise me and see if I'm doing something that obviously makes it slower?
def go_through_zip(zipname):
out = {}
with ZipFile(zipname) as z:
for filename in z.namelist():
with z.open(filename) as f:
try:
outdict = make_dict(f)
out.update(outdict)
except:
print("File is not in the correct format")
return out
make_dict(f) just takes the file path and makes a dictionary, and this function is probably also slow, but that's not what I want to speed up right now.

Try using the following code for file extraction. it works fast as long as the size of the file being extracted is reasonable.
# importing required modules
from zipfile import ZipFile
# specifying the zip file name
file_name = "my_python_files.zip"
# opening the zip file in READ mode
with ZipFile(file_name, 'r') as zip:
# printing all the contents of the zip file
zip.printdir()
# extracting all the files
print('Extracting all the files now...')
zip.extractall()
print('Done!')
```

Related

Download bz2, Read compress files in memory (avoid memory overflow)

As title says, I'm downloading a bz2 file which has a folder inside and a lot of text files...
My first version was decompressing in memory, but Although it is only 90mbs when you uncomrpess it, it has 60 files of 750mb each.... Computer goes bum! obviusly cant handle like 40gb of ram XD)
So, The problem is that they are too big to keep all in memory at the same time... so I'm using this code that works but its sucks (Too slow):
response = requests.get('https:/fooweb.com/barfile.bz2')
# save file into disk:
compress_filepath = '{0}/files/sources/{1}'.format(zsets.BASE_DIR, check_time)
with open(compress_filepath, 'wb') as local_file:
local_file.write(response.content)
#We extract the files into folder
extract_folder = compress_filepath + '_ext'
with tarfile.open(compress_filepath, "r:bz2") as tar:
tar.extractall(extract_folder)
# We process one file at a time:
for filename in os.listdir(extract_folder):
filepath = '{0}/{1}'.format(extract_folder,filename)
file = open(filepath, 'r').readlines()
for line in file:
some_processing(line)
Is there a way I could make this without dumping it to disk... and only decompressing and reading one file from the .bz2 at a time?
Thank you very much for your time in advance, I hope somebody knows how to help me with this...

#!/usr/bin/python3
import sys
import requests
import tarfile
got = requests.get(sys.argv[1], stream=True)
with tarfile.open(fileobj=got.raw, mode='r|*') as tar:
for info in tar:
if info.isreg():
ent = tar.extractfile(info)
# now process ent as a file, however you like
print(info.name, len(ent.read()))

I did it this way:
response = requests.get(my_url_to_file)
memfile = io.BytesIO(response.content)
# We extract files in memory, one by one:
tar = tarfile.open(fileobj=memfile, mode="r:bz2")
for member_name in tar.getnames():
filecount+=1
file = tar.extractfile(member_name)
with open(file, 'r') as read_file:
for line in read_file:
process_line(line)

How to read contents of zip file in memory on a file upload in python?

I have a zip file that I receive when the user uploads a file. The zip essentially contains a json file which I want to read and process without having to create the zip file first, then unzipping it and then reading the content of the inner file.
Currently I only the longer process which is something like below
import json
import zipfile
#csrf_exempt
def get_zip(request):
try:
if request.method == "POST":
try:
client_file = request.FILES['file']
file_path = "/some/path/"
# first dump the zip file to a directory
with open(file_path + '%s' % client_file.name, 'wb+') as dest:
for chunk in client_file.chunks():
dest.write(chunk)
# unzip the zip file to the same directory
with zipfile.ZipFile(file_path + client_file.name, 'r') as zip_ref:
zip_ref.extractall(file_path)
# at this point we get a json file from the zip say `test.json`
# read the json file content
with open(file_path + "test.json", "r") as fo:
json_content = json.load(fo)
doSomething(json_content)
return HttpResponse(0)
except Exception as e:
return HttpResponse(1)
As you can see, this involves 3 steps to finally get the content from the zip file into memory. What I want is get the content of the zip file and load directly into memory.
I did find some similar questions in stack overflow like this one https://stackoverflow.com/a/2463819 . But I am not sure at what point do I invoke this operation mentioned in the post
How can I achieve this?
Note: I am using django in backend.
There will always be one json file in the zip.

From what I understand, what #jason is trying to say here is to first open a zipFile just like you have done here with zipfile.ZipFile(file_path + client_file.name, 'r') as zip_ref:.
class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])
Open a ZIP file, where file can be either a path to a file (a string) or a file-like object.
And then use BytesIO read in the bytes of a file-like object. But from above you are reading in r mode and not rb mode. So change it as follows.
with open(filename, 'rb') as file_data:
bytes_content = file_data.read()
file_like_object = io.BytesIO(bytes_content)
zipfile_ob = zipfile.ZipFile(file_like_object)
Now zipfile_ob can be accessed from memory.

The first argument to zipfile.ZipFile() can be a file object rather than a pathname. I think the Django UploadedFile object supports this use, so you can read directly from that rather than having to copy into a file.
You can also open the file directly from the zip archive rather than extracting that into a file.
import json
import zipfile
#csrf_exempt
def get_zip(request):
try:
if request.method == "POST":
try:
client_file = request.FILES['file']
# unzip the zip file to the same directory
with zipfile.ZipFile(client_file, 'r') as zip_ref:
first = zip_ref.infolist()[0]
with zip_ref.open(first, "r") as fo:
json_content = json.load(fo)
doSomething(json_content)
return HttpResponse(0)
except Exception as e:
return HttpResponse(1)

Extract zip to memory, parse contents

I want to read the contents of a zip file into memory rather than extracting them to disc, find a particular file in the archive, open the file and extract a line from it.
Can a StringIO instance be opened and parsed? Suggestions? Thanks in advance.
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
name = StringIO.StringIO()
print name # prints StringIO instances
open(name, 'r') # IO Error: No such file or directory...
I found a few similar posts, but none that seem to address this issue: Extracting a zipfile to memory?

IMO just using read is enough:
zfile = ZipFile('name.zip', 'r')
files = []
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
files.append(zfile.read(name))
This will make a list with contents of files that match the pattern.
Test:
You can then parse contents afterwards by iterating through the list:
for file in files:
print(file[0:min(35,len(file))].decode()) # "parsing"
Or better use a functor:
import zipfile as zip
import os
import fnmatch
zip_name = os.sys.argv[1]
zfile = zip.ZipFile(zip_name, 'r')
def parse(contents, member_name = ""):
if len(member_name) > 0:
print( "Parsed `{}`:".format(member_name) )
print(contents[0:min(35, len(contents))].decode()) # "parsing"
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*.cpp'):
parse(zfile.read(name), name)
This way there is no data kept in memory for no reason and memory foot print is smaller. It might be important if the files are big.

Don't overthink it. It Just Works:
import zipfile
# 1) I want to read the contents of a zip file ...
with zipfile.ZipFile('A-Zip-File.zip') as zipper:
# 2) ... find a particular file in the archive, open the file ...
with zipper.open('A-Particular-File.txt') as fp:
# 3) ... and extract a line from it.
first_line = fp.readline()
print first_line

The question you link shows you that you need to read the file. Depending on your use case that may already be enough. In your code you replace the loop variable holding a filename with an empty string buffer. Try something like this:
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
ex_file = zfile.open(name) # this is a file like object
content = ex_file.read() # now file-contents are a single string
If you really want a buffer that you can manipulate, then simply instantiate it with the contents:
buf = StringIO(zfile.open(name).read())
You may also want to look at BytesIO and note that there are differences between Python 2 and 3.

Thank you to everyone that contributed solutions. This is what ended up working for me:
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
zopen = zfile.open(name)
for line in zopen:
if re.match('(.*)<foo>(.*)</foo>(.*)', line):
print line

How to print the content of zipped gzip'd files

Ok, so I have a zip file that contains gz files (unix gzip).
Here's what I do --
def parseSTS(file):
import zipfile, re, io, gzip
with zipfile.ZipFile(file, 'r') as zfile:
for name in zfile.namelist():
if re.search(r'\.gz$', name) != None:
zfiledata = zfile.open(name)
print("start for file ", name)
with gzip.open(zfiledata,'r') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
This gives the following result --
>>>
start for file XXXXXX.gz
done opening
done reading
Then stays like that forever until it crashes ...
What can I do with filecontent?
Edit : this is not a duplicate since my gzipped files are in a zipped file and i'm trying to avoid extracting that zip file to disk. It works with zip files in a zip file as per How to read from a zip file within zip file in Python? .

I created a zip file containing a gzip'ed PDF file I grabbed from the web.
I ran this code (with two small changes):
1) Fixed indenting of everything under the def statement (which I also corrected in your Question because I'm sure that it's right on your end or it wouldn't get to the problem you have).
2) I changed:
zfiledata = zfile.open(name)
print("start for file ", name)
with gzip.open(zfiledata,'r') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
to:
print("start for file ", name)
with gzip.open(name,'rb') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
Because you were passing a file object to gzip.open instead of a string. I have no idea how your code is executing without that change, but it was crashing for me until I fixed it.
EDIT: Adding link to GZIP docs from James R's answer --
Also, see here for further documentation:
http://docs.python.org/2/library/gzip.html#examples-of-usage
END EDIT
Now, since my gzip'ed file is small, the behavior I observe is that is pauses for about 3 seconds after printing done reading, then outputs what is in filecontent.
I would suggest adding the following debugging line after your print "done reading" -- print len(filecontent). If this number is very, very large, consider not printing the entire file contents in one shot.
I would also suggest reading this for more insight into what I expect is your problem: Why is printing to stdout so slow? Can it be sped up?
EDIT 2 - an alternative if your system does not handle file io on zip files, causing no such file errors in the above:
def parseSTS(afile):
import zipfile
import zlib
import gzip
import io
with zipfile.ZipFile(afile, 'r') as archive:
for name in archive.namelist():
if name.endswith('.gz'):
bfn = archive.read(name)
bfi = io.BytesIO(bfn)
g = gzip.GzipFile(fileobj=bfi,mode='rb')
qqq = g.read()
print qqq
parseSTS('t.zip')

Most likely your problem lies here:
if name.endswith(".gz"): #as goncalopp said in the comments, use endswith
#zfiledata = zfile.open(name) #don't do this
#print("start for file ", name)
with gzip.open(name,'rb') as gzfile: #gz compressed files should be read in binary and gzip opens the files directly
#print("done opening") #trust in your program, luke
filecontent = gzfile.read()
#print("done reading")
print(filecontent)
See here for further documentation:
http://docs.python.org/2/library/gzip.html#examples-of-usage

downloading large number of files using python

test.txt contains the list of files to be downloaded:
http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif
How these files can be downloaded using python with maximum download speed?
my thinking was as follows:
import urllib.request
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
What after that?How to select download directory?

Select a path to your desired output directory (output_dir). In your for loop split every url on / character and use the last peace as the filename. Also open the files for writing in binary mode wb since the response.read() returns bytes, not str.
import os
import urllib.request
output_dir = 'path/to/you/output/dir'
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
output_file = os.path.join(output_dir, line.split('/')[-1])
with open(output_file, 'wb') as writer:
writer.write(response.read())
Note:
Downloading multiple files can be faster if you use multiple threads since the download is rarely using the full bandwidth of your internet connection._
Also if the files you are downloading are pretty big you should probably stream the read (reading chunk by chunk). As #Tiran commented you should use shutil.copyfileobj(response, writer) instead of writer.write(response.read()).
I would only add that you should probably always specify the length parameter too: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB) since the default value of 16kb is really small and it will just slow things down.

This works fine for me: (note that name must be absolute, for example 'afaf1.tif')
import urllib,os
def download(baseUrl,fileName,layer=0):
print 'Trying to download file:',fileName
url = baseUrl+fileName
name = os.path.join('foldertodwonload',fileName)
try:
#Note that folder needs to exist
urllib.urlretrieve (url,name)
except:
# Upon failure to download retries total 5 times
print 'Download failed'
print 'Could not download file:',fileName
if layer > 4:
return
else:
layer+=1
print 'retrying',str(layer)+'/5'
download(baseUrl,fileName,layer)
print fileName+' downloaded'
for fileName in nameList:
download(url,fileName)
Moved unnecessary code out from try block

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up zip file extraction in Python - python

Related

Download bz2, Read compress files in memory (avoid memory overflow)

How to read contents of zip file in memory on a file upload in python?

Extract zip to memory, parse contents

How to print the content of zipped gzip'd files

downloading large number of files using python

Categories

Resources