I ran some code in python that created a large number of .csv files. I'm doing this on a supercomputer and I used mpi4py to manage the parallel processing. Several processes ran on each node, with a number of operations done by each process. Each process created a .csv file. I didn't want to drop all the .csv files onto the main hard disk because there were so many. So instead, I wrote them to local SSDs, each SSD connected to a node. Then, I wanted to zip all the .csvs from each SSD into one zip archive per node and put the archive on the main hard drive. I did it like this:
from mpi4py import MPI
import os
import platform
import zipfile
# MPI setup
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
def get_localrank(comm):
comm_localrank = MPI.Comm.Split_type(comm, MPI.COMM_TYPE_SHARED, 0)
return comm_localrank.Get_rank()
def ls_csv(path, ext=".csv"):
all_files = os.listdir(path)
return [ path+"/"+f for f in all_files if f.endswith(ext) ]
def ssd_to_drive(comm, drive_path):
# only execute on one MPI process per node
if get_localrank(comm) == 0:
# every node's file needs a different name
archive_name = drive_path + "/data_" + platform.node() + ".zip"
csvs = ls_csv("ssd_path")
zf = zipfile.ZipFile(archive_name, 'w', zipfile.ZIP_DEFLATED)
for csvf in csvs:
zf.write(csvf)
zf.close()
##copy archived csvs from the ssds back to project
ssd_to_drive(comm=comm, drive_path="mypath")
I get the zip files back in the directory I want them, but I can't open them. Either they are corrupt or unzip thinks they are part of a multi-part archive, and if that's the case, I don't see how to reconstruct them.
When I do less data_nodename.zip, I get the following (where the alphanumeric string before each error statement represents "nodename"):
e33n08
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
f09n09
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
f09n10
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
e33n06
Zipfile is disk 37605 of a multi-disk archive, and this is not the disk on
which the central zipfile directory begins (disk 25182).
f09n11
Zipfile is disk 40457 of a multi-disk archive, and this is not the disk on
which the central zipfile directory begins (disk 740).
a06n10
end-of-central-directory record claims this
is disk 11604 but that the central directory starts on disk 48929; this is a
contradiction. Attempting to process anyway.
f10n14
end-of-central-directory record claims this
is disk 15085 but that the central directory starts on disk 52010; this is a
contradiction.
Of note, I only have 3 zip files that seem to claim to be the beginning of the multi-disk archive (e33n08, f09n09, f09n10), but at least 4 references to a "beginning of central directory" disk (25182, 740, 48929, 52010).
So now I can't figure out if these are corrupt or if zipfile really thought it was creating a multi-disk archive, why this happened, or how to reconstruct the archive if it really is multi-disk.
Finally, when I did this procedure with multiple mpi tasks but only one node, the single zip archive created was fine and readable with less and unzip.
Can't believe I missed this, but the files were indeed corrupt. The job had exited before writing the zip archives was complete.
Related
dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# one of subdirs contain huge number of files
files = [os.path.join(file, f) for file in subdirs for f in os.listdir(file)]
The code ran smoothly first few times under 30 seconds but over different runs of the same code, the time increased to 11 minutes and now not even running in 11 minutes. The problem is in the 3rd line and I suspect os.listdir for this.
EDIT: Just want to read the files so that it can be sent as argument to a multiprocessing function. RAM is also not an issue as RAM is ample and not even 1/10th of RAM is used by the program
It might leads that os.listdir(dir_) reads the entire directory tree and returns a list of all the files and subdirectories in dir_. This process can take a long time if the directory tree is very large or if the system is under heavy load.
But instead that use either below method or use walk() method.
dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# Create an empty list to store the file paths
files = []
for subdir in subdirs:
# Use os.scandir() to iterate over the files and directories in the subdirectory
with os.scandir(subdir) as entries:
for entry in entries:
# Check if the entry is a regular file
if entry.is_file():
# Add the file path to the list
files.append(entry.path)
I want to tar a 49 GB folder so it's compressed, but not persisted afterwards.
I'm using the below code to create a temporary directory (I chose the parent path) and then shutil.make_archive:
import tempfile
import shutil
with tempfile.TemporaryDirectory(dir=temp_dir_path) as temp_output_dir:
output_path = shutil.make_archive(temp_output_dir, "tar", dir_to_tar)
However, the size of the tar file generated is larger than when I du -sh the original folder. I stopped the code when it reached 51 GB.
Am I using this wrong?
The tar (tape archive) format doesn't compress, it just "archives" files together (it's basically just a directory structure and the files it contains shoved into a single file). Definitionally, it's a little larger than the original data.
If you want the tarball to be compressed so it ends up smaller than the original data, use one of the compressed format arguments to make_archive, e.g. 'gztar' (faster, worse compression, most portable) or 'xztar' (slower, best compression, somewhat less portable) instead of just 'tar'.
I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)
I am trying to use py7zlib to open and read files stored in .7z archives. I am able to do this, but it appears to be causing a memory leak. After scanning through a few hundred .7z files using py7zlib, Python crashes with a MemoryError. I don't have this problem when doing the equivalent operations on .zip files using the built-in zipfile library. My process with the .7z files is essentially as follows (look for a subfile in the archive with a given name and return its contents):
with open(filename, 'rb') as f:
z = py7zlib.Archive7z(f)
names = z.getnames()
if subName in names:
subFile = z.getmember(subName)
contents = subFile.read()
else:
contents = None
return contents
Does anyone know why this would be causing a memory leak once the Archive7z object passes out of scope if I am closing the .7z file object? Is there any kind of cleanup or file-closing procedure I need to follow (like with the zipfile library's ZipFile.close())?
Using IronPython 2.6 (I'm new), I'm trying to write a program that opens a file, saves it at a series of locations, and then opens/manipulates/re-saves those. It will be run by an upper-level program on a loop, and this entire procedure is designed to catch/preserve corrupted saves so my company can figure out why this glitch of corruption occasionally happens.
I've currently worked out the Open/Save to locations parts of the script and now I need to build a function that opens, checks for corruption, and (if corrupted) moves the file into a subfolder (with an iterative renaming applied, for copies) or (if okay), modifies the file and saves a duplicate, where the process is repeated on the duplicate, sans duplication.
I tell this all for context to the root problem. In my situation, what is the most pythonic, consistent, and windows/unix friendly way to move a file (corrupted) into a subfolder while also renaming it based on the number of pre-existing copies of the file that exist within said subfolder?
In other words:
In a folder structure built as:
C:\Folder\test.txt
C:\Folder\Subfolder
C:\Folder\Subfolder\test.txt
C:\Folder\Subfolder\test01.txt
C:\Folder\Subfolder\test02.txt
C:\Folder\Subfolder\test03.txt
How to I move test.txt such that:
C:\Folder\Subfolder
C:\Folder\Subfolder\test.txt
C:\Folder\Subfolder\test01.txt
C:\Folder\Subfolder\test02.txt
C:\Folder\Subfolder\test03.txt
C:\Folder\Subfolder\test04.txt
In an automated way, so that I can loop my program overnight and have it stack up the corrupted text files I need to save? Note: They're not text files in practice, just example.
assuming you are going to use the convention of incrementally suffinxing numbers to the files:
import os.path
import shutil
def store_copy( file_to_copy, destination):
filename, extension = os.path.splitext( os.path.basename(file_to_copy)
existing_files = [i for in in os.listdir(destination) if i.startswith(filename)]
new_file_name = "%s%02d%s" % (filename, len(existing_files), extension)
shutil.copy2(file_to_copy, os.path.join(destination, new_file_name)
There's a fail case if you have subdirectories or files in destination whose names overlap with the source file, ie, if your file is named 'example.txt' and the destination containst 'example_A.txt' as well as 'example.txt' and 'example01.txt' If that's a possibility you'd have to change the test in the existing files = line to something more sophisticated