Extract zip to memory, parse contents - python

I want to read the contents of a zip file into memory rather than extracting them to disc, find a particular file in the archive, open the file and extract a line from it.
Can a StringIO instance be opened and parsed? Suggestions? Thanks in advance.
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
name = StringIO.StringIO()
print name # prints StringIO instances
open(name, 'r') # IO Error: No such file or directory...
I found a few similar posts, but none that seem to address this issue: Extracting a zipfile to memory?

IMO just using read is enough:
zfile = ZipFile('name.zip', 'r')
files = []
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
files.append(zfile.read(name))
This will make a list with contents of files that match the pattern.
Test:
You can then parse contents afterwards by iterating through the list:
for file in files:
print(file[0:min(35,len(file))].decode()) # "parsing"
Or better use a functor:
import zipfile as zip
import os
import fnmatch
zip_name = os.sys.argv[1]
zfile = zip.ZipFile(zip_name, 'r')
def parse(contents, member_name = ""):
if len(member_name) > 0:
print( "Parsed `{}`:".format(member_name) )
print(contents[0:min(35, len(contents))].decode()) # "parsing"
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*.cpp'):
parse(zfile.read(name), name)
This way there is no data kept in memory for no reason and memory foot print is smaller. It might be important if the files are big.

Don't overthink it. It Just Works:
import zipfile
# 1) I want to read the contents of a zip file ...
with zipfile.ZipFile('A-Zip-File.zip') as zipper:
# 2) ... find a particular file in the archive, open the file ...
with zipper.open('A-Particular-File.txt') as fp:
# 3) ... and extract a line from it.
first_line = fp.readline()
print first_line

The question you link shows you that you need to read the file. Depending on your use case that may already be enough. In your code you replace the loop variable holding a filename with an empty string buffer. Try something like this:
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
ex_file = zfile.open(name) # this is a file like object
content = ex_file.read() # now file-contents are a single string
If you really want a buffer that you can manipulate, then simply instantiate it with the contents:
buf = StringIO(zfile.open(name).read())
You may also want to look at BytesIO and note that there are differences between Python 2 and 3.

Thank you to everyone that contributed solutions. This is what ended up working for me:
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
zopen = zfile.open(name)
for line in zopen:
if re.match('(.*)<foo>(.*)</foo>(.*)', line):
print line

Related

How to speed up zip file extraction in Python

I'm trying to extract data from a zip file in Python, but it's kind of slow. Could anyone advise me and see if I'm doing something that obviously makes it slower?
def go_through_zip(zipname):
out = {}
with ZipFile(zipname) as z:
for filename in z.namelist():
with z.open(filename) as f:
try:
outdict = make_dict(f)
out.update(outdict)
except:
print("File is not in the correct format")
return out
make_dict(f) just takes the file path and makes a dictionary, and this function is probably also slow, but that's not what I want to speed up right now.
Try using the following code for file extraction. it works fast as long as the size of the file being extracted is reasonable.
# importing required modules
from zipfile import ZipFile
# specifying the zip file name
file_name = "my_python_files.zip"
# opening the zip file in READ mode
with ZipFile(file_name, 'r') as zip:
# printing all the contents of the zip file
zip.printdir()
# extracting all the files
print('Extracting all the files now...')
zip.extractall()
print('Done!')
```

Sending multiple .CSV files to .ZIP without storing to disk in Python

I'm working on a reporting application for my Django powered website. I want to run several reports and have each report generate a .csv file in memory that can be downloaded in batch as a .zip. I would like to do this without storing any files to disk. So far, to generate a single .csv file, I am following the common operation:
mem_file = StringIO.StringIO()
writer = csv.writer(mem_file)
writer.writerow(["My content", my_value])
mem_file.seek(0)
response = HttpResponse(mem_file, content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename=my_file.csv'
This works fine, but only for a single, unzipped .csv. If I had, for example, a list of .csv files created with a StringIO stream:
firstFile = StringIO.StringIO()
# write some data to the file
secondFile = StringIO.StringIO()
# write some data to the file
thirdFile = StringIO.StringIO()
# write some data to the file
myFiles = [firstFile, secondFile, thirdFile]
How could I return a compressed file that contains all objects in myFiles and can be properly unzipped to reveal three .csv files?
zipfile is a standard library module that does exactly what you're looking for. For your use-case, the meat and potatoes is a method called "writestr" that takes a name of a file and the data contained within it that you'd like to zip.
In the code below, I've used a sequential naming scheme for the files when they're unzipped, but this can be switched to whatever you'd like.
import zipfile
import StringIO
zipped_file = StringIO.StringIO()
with zipfile.ZipFile(zipped_file, 'w') as zip:
for i, file in enumerate(files):
file.seek(0)
zip.writestr("{}.csv".format(i), file.read())
zipped_file.seek(0)
If you want to future-proof your code (hint hint Python 3 hint hint), you might want to switch over to using io.BytesIO instead of StringIO, since Python 3 is all about the bytes. Another bonus is that explicit seeks are not necessary with io.BytesIO before reads (I haven't tested this behavior with Django's HttpResponse, so I've left that final seek in there just in case).
import io
import zipfile
zipped_file = io.BytesIO()
with zipfile.ZipFile(zipped_file, 'w') as f:
for i, file in enumerate(files):
f.writestr("{}.csv".format(i), file.getvalue())
zipped_file.seek(0)
The stdlib comes with the module zipfile, and the main class, ZipFile, accepts a file or file-like object:
from zipfile import ZipFile
temp_file = StringIO.StringIO()
zipped = ZipFile(temp_file, 'w')
# create temp csv_files = [(name1, data1), (name2, data2), ... ]
for name, data in csv_files:
data.seek(0)
zipped.writestr(name, data.read())
zipped.close()
temp_file.seek(0)
# etc. etc.
I'm not a user of StringIO so I may have the seek and read out of place, but hopefully you get the idea.
def zipFiles(files):
outfile = StringIO() # io.BytesIO() for python 3
with zipfile.ZipFile(outfile, 'w') as zf:
for n, f in enumarate(files):
zf.writestr("{}.csv".format(n), f.getvalue())
return outfile.getvalue()
zipped_file = zip_files(myfiles)
response = HttpResponse(zipped_file, content_type='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename=my_file.zip'
StringIO has getvalue method which return the entire contents. You can compress the zipfile
by zipfile.ZipFile(outfile, 'w', zipfile.ZIP_DEFLATED). Default value of compression is ZIP_STORED which will create zip file without compressing.

Read a list from a file and append to it using Python

I have a file called usernames.py that may contain a list or does exist at all:
usernames.py
['user1', 'user2', 'user3']
In Python I now want to read this file if it exists and append to the list a new user or create a list with that user i.e. ['user3']
This is what I have tried:
with open(path + 'usernames.py', 'w+') as file:
file_string = host_file.read()
file_string.append(instance)
file.write(file_string)
This gives me an error unresolved 'append'. How can I achieve this? Python does not know it is a list and if the file does not exist even worst as I have nothing to convert to a list.
Try this:
import os
filename = 'data'
if os.path.isfile(filename):
with open(filename, 'r') as f:
l = eval(f.readline())
else:
l = []
l.append(instance)
with open(filename, 'w') as f:
f.write(str(l))
BUT this is quite unsafe if you don't know where the file is from as it could include any code to do anything!
It would be better not to use a python file for persistence -- what happens if someone slips you a usernames.py that has exploit code in it? Consider a csv file or a pickle, or just a text file with one user per line.
That said, if you don't open it as a python file, something like this should work:
from os.path import join
with open( join(path, 'usernames.py'), 'r+') as file:
file_string = file.read()
file_string = file_string.strip().strip('[').strip(']')
file_data = [ name.strip().strip('"').strip("'") for name in file_string.split(',' )]
file_data.append( instance )
file.fseek(0)
file.write(str(file_data))
If usernames contain commas or end in quotes, you have to be more careful.

Error when trying to read and write multiple files

I modified the code based on the comments from experts in this thread. Now the script reads and writes all the individual files. The script reiterates, highlight and write the output. The current issue is, after highlighting the last instance of the search item, the script removes all the remaining contents after the last search instance in the output of each file.
Here is the modified code:
import os
import sys
import re
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f
infile = open(filepath, 'r+')
source_content = infile.read()
color = ('red')
regex = re.compile(r"(\b be \b)|(\b by \b)|(\b user \b)|(\bmay\b)|(\bmight\b)|(\bwill\b)|(\b's\b)|(\bdon't\b)|(\bdoesn't\b)|(\bwon't\b)|(\bsupport\b)|(\bcan't\b)|(\bkill\b)|(\betc\b)|(\b NA \b)|(\bfollow\b)|(\bhang\b)|(\bbelow\b)", re.I)
i = 0; output = ""
for m in regex.finditer(source_content):
output += "".join([source_content[i:m.start()],
"<strong><span style='color:%s'>" % color[0:],
source_content[m.start():m.end()],
"</span></strong>"])
i = m.end()
outfile = open(filepath, 'w+')
outfile.seek(0)
outfile.write(output)
print "\nProcess Completed!\n"
infile.close()
outfile.close()
raw_input()
The error message tells you what the error is:
No such file or directory: 'sample1.html'
Make sure the file exists. Or do a try statement to give it a default behavior.
The reason why you get that error is because the python script doesn't have any knowledge about where the files are located that you want to open.
You have to provide the file path to open it as I have done below. I have simply concatenated the source file path+'\\'+filename and saved the result in a variable named as filepath. Now simply use this variable to open a file in open().
import os
import sys
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f # This is the file path
infile = open(filepath, 'r')
Also there are couple of other problems with your code, if you want to open the file for both reading and writing then you have to use r+ mode. More over in case of Windows if you open a file using r+ mode then you may have to use file.seek() before file.write() to avoid an other issue. You can read the reason for using the file.seek() here.

Reading the same file multiple times in Python

I need to download a zip archive of text files, dispatch each text file in the archive to other handlers for processing, and finally write the unzipped text file to disk.
I have the following code. It uses multiple open/close on the same file, which does not seem elegant. How do I make it more elegant and efficient?
zipped = urllib.urlopen('www.abc.com/xyz.zip')
buf = cStringIO.StringIO(zipped.read())
zipped.close()
unzipped = zipfile.ZipFile(buf, 'r')
for f_info in unzipped.infolist():
logfile = unzipped.open(f_info)
handler1(logfile)
logfile.close() ## Cannot seek(0). The file like obj does not support seek()
logfile = unzipped.open(f_info)
handler2(logfile)
logfile.close()
unzipped.extract(f_info)
Your answer is in your example code. Just use StringIO to buffer the logfile:
zipped = urllib.urlopen('www.abc.com/xyz.zip')
buf = cStringIO.StringIO(zipped.read())
zipped.close()
unzipped = zipfile.ZipFile(buf, 'r')
for f_info in unzipped.infolist():
logfile = unzipped.open(f_info)
# Here's where we buffer:
logbuffer = cStringIO.StringIO(logfile.read())
logfile.close()
for handler in [handler1, handler2]:
handler(logbuffer)
# StringIO objects support seek():
logbuffer.seek(0)
unzipped.extract(f_info)
You could say something like:
handler_dispatch(logfile)
and
def handler_dispatch(file):
for line in file:
handler1(line)
handler2(line)
or even make it more dynamic by constructing a Handler class with multiple handlerN functions, and applying each of them inside handler_dispatch. Like
class Handler:
def __init__(self:)
self.handlers = []
def add_handler(handler):
self.handlers.append(handler)
def handler_dispatch(self, file):
for line in file:
for handler in self.handlers:
handler.handle(line)
Open the zip file once, loop through all the names, extract the file for each name and process it, then write it to disk.
Like so:
for f_info in unzipped.info_list():
file = unzipped.open(f_info)
data = file.read()
# If you need a file like object, wrap it in a cStringIO
fobj = cStringIO.StringIO(data)
handler1(fobj)
handler2(fobj)
with open(filename,"w") as fp:
fp.write(data)
You get the idea

Categories

Resources