I have some code which I am using to open a large zip which contains some csv files and then parse them.
I am using this code below but I am now wondering if I am actually unzipping the entire file into memory and then extracting the file contents to disk as well, after which I read the files in one by one.
def unzip_file(file_path):
zip_ref = zipfile.ZipFile(file_path, 'r')
extracted = zip_ref.namelist()
zip_ref.extractall('/tmp/extracts')
zip_ref.close()
return extracted
Is this actually unzipping the files and their contents into memory and then extracting the files straight to disk? I use the extracted variable afterwards as it contains a list of the file names I need to process but I dont also want to open each file into memory and then read them again.
Your concern is that you are wasting memory or being inefficient in the manner you are reading the files when extracting them. The answer to if you're doing anything "wrong" is simply: "No". Your code is correct and it does not keep files in memory after you have finished the function call.
A few notes on what you can improve though.
Use Context Managers to Automatically Close File
The ZipFile is also a context manager and it is generally considered best practice to use it to make sure that files are closed and cleaned up from memory correctly. Instead of calling .close() manually you could do the following:
with ZipFile(file_path, "r") as zip_ref:
zip_ref.extractall("/tmp/extracts")
It will then automatically close the file after the context manager is done, and make sure that nothing is stored in memory.
Since you close the file, you do not have to fear that it will stay in memory.
Read Files without Extracting
Since you are extracting the files to a /tmp/ folder, I guess(?) that you actually don't want to store the files on disk. Perhaps all you want to do is to read the data and do something with it.
You can read each file within the zip file without extracting them to disk.
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
print(myfile.read())
This might be a better solution depending on what you want to achieve. You can see more from the python docs.
Related
What I need to achieve is to concatenate a list of files into a single file, using the cloudstorage library. This needs to happen inside a mapreduce shard, which has a 512MB upper limit on memory, but the concatenated file could be larger than 512MB.
The following code segment breaks when file size hit the memory limit.
list_of_files = [...]
with cloudstorage.open(filename...) as file_handler:
for a in list_of_files:
with cloudstorage.open(a) as f:
file_handler.write(f.read())
Is there a way to walk around this issue? Maybe open or append files in chunk? And how to do that? Thanks!
== EDIT ==
After some more testing, it seems that memory limit only applies to f.read(), while writing to a large file is okay. Reading files in chunks solved my issue, but I really like the compose() function as #Ian-Lewis pointed out. Thanks!
For large file you will want to break the file up into smaller files, upload each of those and then merge them together as composite objects. You will want to use the compose() function from the library. It seems there is no docs on it yet.
After you've uploaded all the parts something like the following should work. One thing to make sure of is that the paths files to be composed don't contain the bucket name or a slash at the beginning.
stat = cloudstorage.compose(
[
"path/to/part1",
"path/to/part2",
"path/to/part3",
# ...
],
"/my_bucket/path/to/output"
)
You may also want to check out using the gsutil tool if possible. It can do automatic splitting, uploading in parallel, and compositing of large files for you.
I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!
gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.
I've been trying to use the built-in python zipfiles module to manipulate some .zip files on windows, I wish to use them to store a number of files related to the current project in a program. The problem comes when I load the files from the zip and then wish to re-save them into a new, different zip file:
import zipfile
zp = zipfile.ZipFile(r"first.zip",mode='r')
myfile = zp.open(r"stored_file.txt",mode='r')
### Do something, then want to save again ###
zp2 = zipfile.ZipFile(r"second.zip",mode='w')
#Doesn't work, as myfile isn't a real file:
zp2.write(myfile)
#Doesn't work, as the path can't be resolved:
zp2.write(os.path.join(zp.filename,myfile.name))
#The following works... as long as you haven't called read()
#since .seek(0) doesn't work for ZipExtFile
zp2.writestr(myfile.name,myfile.read())
I could, of course, extract the files to somewhere and then re-add them to the new zip that way, but it would be clunky and require a lot of cleanup (and creating a lot of temporary files).
Equally I could keep track of the original zip file and use the writestr method by re-opening the file, but I was hoping to avoid it. I just wondered if there was a better way around this problem; it means I'll have to have code that determines whether the file originally came from a zip or not as well and handle it differently if it did.
Edit: If anyone else has the final problem with seek(0) not working on ZipExtFile, it is possible to use an io.StringIO class to hold the result of str(myfile.read()), which is then seekable. It means I have to keep the files loaded in memory, though, so I'm going to go with keeping track of the zipfile and transferring them only when I need them.
I have a simple problem that I hope will have a simple solution.
I am writing python(2.7) code using the xlwt package to write excel files. The program takes data and writes it out to a file that is being saved constantly. The problem is that whenever I have the file open to check the data and python tries to save the file the program crashes.
Is there any way to make python save the file when I have it open for reading?
My experience is that sashkello is correct, Excel locks the file. Even OpenOffice/LibreOffice do this. They lock the file on disk and create a temp version as a working copy. ANY program trying to access the open file will be denied by the OS. The reason for this is because many corporations treat Excel files as databases but the users have no understanding of the issues involved in concurrency and synchronisation.
I am on linux and I get this behaviour (at least when the file is on a SAMBA share). Look in the same directory as your file, if a file called .~lock.[filename]# exists then you will be unable to read your file from another program. I'm not sure what enforces this lock but I suspect it's an NTFS attribute. Note that even a simple cp or cat fails: cp: error reading ‘CATALOGUE.ods’: Input/output error
UPDATE: The actual locking mechanism appears to be 'oplocks`, a concept connected to Windows shares: http://oreilly.com/openbook/samba/book/ch05_05.html . If the share is managed by Samba the workaround is to disable locks on certain file types, eg:
veto oplock files = /*.xlsx/
If you aren't using a share or NTFS on linux then I guess you should be able to RW the file as long as your script has write permissions. By default only the user who created the file has write access.
WORKAROUND 2: The restriction only seems to apply if you have the file open in Excel/LO as writable, however LO at least allows you to open a file as read-only (Go to File -> Properties -> Security, set Read-Only, Save and re-open the file). I don't know if this will also make it RO for xlwt though.
Hah, funny I ran across your post. I actually just implemented this tonight.
The issue is that Excel files write, and that's it, not both. You cannot read/write off the same object. So if you have another method to save data please do. I'm in a position where I don't have an option.. and so might you.
You're going to need xlutils it's the bread and butter to this.
Here's some example code:
from xlutils.copy import copy
wb_filename = 'example.xls'
wb_object = xlrd.open_workbook(wb_filename)
# And then you can read this file to your hearts galore.
# Now when it comes to writing to this, you need to copy the object and work off that.
write_object = copy(wb_object)
# Write to it all you want and then save that object.
And that's it, now if you read the object, write to it, and read the original one again it won't be updated. You either need to recreate wb_object or you need to create some sort of table in memory that you can keep track of while working through it.
I just found out that I can save space\ speed up reads of CSV files.
Using the answer of my previous question
How do I create a CSV file from database in Python?
And 'wb' for opens
w = csv.writer(open(Fn,'wb'),dialect='excel')
How can I open all files in a directory and saves all files with the same name as starting name and use 'wb' to reformat all files. I guess convert all CSV's to binary CSV's.
You can't "overwrite a file on the fly". You have two options:
if the files are small enough (smaller than the amount of available RAM by
a comfortable margin), just loop over them (os.listdir makes that loop
easy, or os.walk if you want to catch the whole tree of subdirectories,
not just one directory), and for each, read it in memory first, then
overwrite the on-disk copy.
otherwise, loop over them, and each time write to a new file (e.g. by
appending .new to the name), then move the new file over the old. This
is safer (no risk of running out of memory, no risk of damaging a file if
the computer crashes) but more complicated.
So, what is your situation: small-enough files (and backups for safeguard against computer and disk crashes), in which case I can if you wish show the simple code; or huge multi-GB files -- in which case it will have to be the complex code? Let us know!