I just found out that I can save space\ speed up reads of CSV files.
Using the answer of my previous question
How do I create a CSV file from database in Python?
And 'wb' for opens
w = csv.writer(open(Fn,'wb'),dialect='excel')
How can I open all files in a directory and saves all files with the same name as starting name and use 'wb' to reformat all files. I guess convert all CSV's to binary CSV's.
You can't "overwrite a file on the fly". You have two options:
if the files are small enough (smaller than the amount of available RAM by
a comfortable margin), just loop over them (os.listdir makes that loop
easy, or os.walk if you want to catch the whole tree of subdirectories,
not just one directory), and for each, read it in memory first, then
overwrite the on-disk copy.
otherwise, loop over them, and each time write to a new file (e.g. by
appending .new to the name), then move the new file over the old. This
is safer (no risk of running out of memory, no risk of damaging a file if
the computer crashes) but more complicated.
So, what is your situation: small-enough files (and backups for safeguard against computer and disk crashes), in which case I can if you wish show the simple code; or huge multi-GB files -- in which case it will have to be the complex code? Let us know!
Related
In Raspberry Pi 4, I am trying to read a series of files(more than 1000 files) in the specific directory with the fopen function in the for loop, but fopen cannot read the file if it exceeds a certain number of iterations. How do I solve this?
but fopen cannot read the file if it exceeds a certain number of iterations.
A wild guess: you neglect to fclose the files after you are done with them, leading to eventual exhaustion of either memory or available file descriptors.
How do I solve this?
Make sure to fclose your files.
When you open file using fopen the system uses a file descriptor to point to that file. And there are only so many of them available. A quick google search says microsoft usually has 512 file descriptors. Also, the file is loaded to memory. So, loading a lot of files will use up your memory fast. You should close each file after you are done with them. This usually is not a problem when working with a few files. But in case like yours where thousands of files are necessary, they should be closed as soon as possible after using them.
I have some code which I am using to open a large zip which contains some csv files and then parse them.
I am using this code below but I am now wondering if I am actually unzipping the entire file into memory and then extracting the file contents to disk as well, after which I read the files in one by one.
def unzip_file(file_path):
zip_ref = zipfile.ZipFile(file_path, 'r')
extracted = zip_ref.namelist()
zip_ref.extractall('/tmp/extracts')
zip_ref.close()
return extracted
Is this actually unzipping the files and their contents into memory and then extracting the files straight to disk? I use the extracted variable afterwards as it contains a list of the file names I need to process but I dont also want to open each file into memory and then read them again.
Your concern is that you are wasting memory or being inefficient in the manner you are reading the files when extracting them. The answer to if you're doing anything "wrong" is simply: "No". Your code is correct and it does not keep files in memory after you have finished the function call.
A few notes on what you can improve though.
Use Context Managers to Automatically Close File
The ZipFile is also a context manager and it is generally considered best practice to use it to make sure that files are closed and cleaned up from memory correctly. Instead of calling .close() manually you could do the following:
with ZipFile(file_path, "r") as zip_ref:
zip_ref.extractall("/tmp/extracts")
It will then automatically close the file after the context manager is done, and make sure that nothing is stored in memory.
Since you close the file, you do not have to fear that it will stay in memory.
Read Files without Extracting
Since you are extracting the files to a /tmp/ folder, I guess(?) that you actually don't want to store the files on disk. Perhaps all you want to do is to read the data and do something with it.
You can read each file within the zip file without extracting them to disk.
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
print(myfile.read())
This might be a better solution depending on what you want to achieve. You can see more from the python docs.
I currently have the following csv writer class:
class csvwriter():
writer = None
writehandler = None
#classmethod
def open(cls,file):
cls.writehandler = open(file,'wb')
cls.writer = csv.writer(cls.writehandler, delimiter=',',quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
#classmethod
def write(cls,arr):
cls.writer.writerow(arr)
#classmethod
def close(cls):
cls.writehandler.close()
which can generate proper csv files without ever having to store the full array in memory at a single time.
However, the files created through use of this code can be quite large, so I'm looking to compress them, rather than writing them uncompressed. (In order to save on disk usage). I can't effectively store the file in memory either, as I'm expecting files of well over 20gb to be a regular occurence.
The recipients of the resulting files are generally not sysadmins of their PCs, nor do they all use linux, so I'm constrained in the types of algorithms I'm able to use for this task. Preferably, the solution would use a compression scheme that's natively readable (no executables required) in Windows, OSX and any linux distribution.
I've found gzip provides a very handy interface in Python, but reading gzipped files in windows seems like quite a hassle.. Ideally I'd put them in a zip archive, but zip archive don't allow you to append data to files already present in the archive, which then forces me to store the whole file in memory, or write the data away to several smaller files that I would be able to fit in memory.
My question: Is there a solution that would benefit from the best of both worlds? Widespread availability of tools to read the target format on the end-user's machine, and also the ability to append, rather than write the whole file in one go?
Thanks in advance for your consideration!
gzlog may provide the functionality you're looking for. It efficiently appends short strings to a gzip file, intended for applications where short messages are appended to a long log.
I've been trying to use the built-in python zipfiles module to manipulate some .zip files on windows, I wish to use them to store a number of files related to the current project in a program. The problem comes when I load the files from the zip and then wish to re-save them into a new, different zip file:
import zipfile
zp = zipfile.ZipFile(r"first.zip",mode='r')
myfile = zp.open(r"stored_file.txt",mode='r')
### Do something, then want to save again ###
zp2 = zipfile.ZipFile(r"second.zip",mode='w')
#Doesn't work, as myfile isn't a real file:
zp2.write(myfile)
#Doesn't work, as the path can't be resolved:
zp2.write(os.path.join(zp.filename,myfile.name))
#The following works... as long as you haven't called read()
#since .seek(0) doesn't work for ZipExtFile
zp2.writestr(myfile.name,myfile.read())
I could, of course, extract the files to somewhere and then re-add them to the new zip that way, but it would be clunky and require a lot of cleanup (and creating a lot of temporary files).
Equally I could keep track of the original zip file and use the writestr method by re-opening the file, but I was hoping to avoid it. I just wondered if there was a better way around this problem; it means I'll have to have code that determines whether the file originally came from a zip or not as well and handle it differently if it did.
Edit: If anyone else has the final problem with seek(0) not working on ZipExtFile, it is possible to use an io.StringIO class to hold the result of str(myfile.read()), which is then seekable. It means I have to keep the files loaded in memory, though, so I'm going to go with keeping track of the zipfile and transferring them only when I need them.
I have some bash code which moves files and directory to /tmp/rmf rather than deleting them, for safety purposes.
I am migrating the code to Python to add some functionality. One of the added features is checking the available size on /tmp and asserting that the moved directory can fit in /tmp.
Checking for available space is done using os.statvfs, but how can I measure the disk usage of the moved directory?
I could either call du using subprocess, or recursively iterate over the directory tree and sum the sizes of each file. Which approach would be better?
I think you might want to reconsider your strategy. Two reasons:
Checking if you can move a file, asserting you can move a file, and then moving a file provides a built-in race-condition to the operation. A big file gets created in /tmp/ after you've asserted but before you've moved your file.. Doh.
Moving the file across filesystems will result in a huge amount of overhead. This is why on OSX each volume has their own 'Trash' directory. Instead of moving the blocks that compose the file, you just create a new inode that points to the existing data.
I'd consider how long the file needs to be available and the visibility to consumers of the files. If it's all automated stuff happening on the backend - renaming a file to 'hide' it from computer and human consumers is easy enough in most cases and has the added benefit of being an atomic operation)
Occasionally scan the filesystem for 'old' files to cull and rm them after some grace period. No drama. Also makes restoring files a lot easier since it's just a rename to restore.
This should do the trick:
import os
path = 'THE PATH OF THE DIRECTORY YOU WANT TO FETCH'
os.statvfs(path)