Archive files directly from memory in Python

Archive files directly from memory in Python - python

I'm writing this program where I get a number of files, then zip them with encryption using pyzipper, and also I'm using io.BitesIO() to write these files to it so I keep them in-memory. So now, after some other additions, I want to get all of these in-memory files and zip them together in a single encrypted zip file using the same pyzipper.
The code looks something like this:
# Create the in-memory file object
in_memory = BytesIO()
# Create the zip file and open in write mode
with pyzipper.AESZipFile(in_memory, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zip_file:
# Set password
zip_file.setpassword(b"password")
# Save "data" with file_name
zip_file.writestr(file_name, data)
# Go to the beginning
in_memory.seek(0)
# Read the zip file data
data = in_memory.read()
# Add the data to a list
files.append(data)
So, as you may guess the "files" list is an attribute from a class and the whole thing above is a function that does this a number of times and then you get the full files list. For simplicity's sake, I removed most of the irrelevant parts.
I get no errors for now, but when I try to write all files to a new zip file I get an error. Here's the code:
with pyzipper.AESZipFile(test_name, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zfile:
zfile.setpassword(b"pass")
for file in files:
zfile.write(file)
I get a ValueError because of os.stat:
File "C:\Users\vulka\AppData\Local\Programs\Python\Python310\lib\site-packages\pyzipper\zipfile.py", line 820, in from_file
st = os.stat(filename)
ValueError: stat: embedded null character in path
[WHAT I TRIED]
So, I tried using mmap for this purpose but I don't think this can help me and if it can - then I have no idea how to make it work.
I also tried using fs.memoryfs.MemoryFS to temporarily create a virtual filessystem in memory to store all the files and then get them back to zip everything together and then save it to disk. Again - failed. I got tons of different errors in my tests and TBH, there's very little information out there on this fs method and even if what I'm trying to do is possible - I couldn't figure it out.
P.S: I don't know if pyzipper (almost 1:1 zipfile with the addition of encryption) supports nested zip files at all. This could be the problem I'm facing but if it doesn't I'm open to any suggestions for a new approach to doing this. Also, I don't want to rely on a 3rd party software, even if it is open source! (I'm talking about the method of using 7zip to do all the archiving and ecryption, even though it shouldn't even be possible to use it without saving the files to disk in the first place, which is the main thing I'm trying to avoid)

Related

Duplicate in list created from filenames (python)

I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro

If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).

Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.

You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)

Run python zip file from memory at runtime?

I am trying to run a python zip file which is retrieved using requests.get. The zip file has several directories of python files in addition to the __main__.py, so in the interest of easily sending it as a single file, I am zipping it.
I know the file is being sent correctly, as I can save it to a file and then run it, however, I want to execute it without writing it to storage.
The working part is more or less as follows:
import requests
response = requests.get("http://myurl.com/get_zip")
I can write the zip to file using
f = open("myapp.zip","wb")
f.write(response.content)
f.close()
and manually run it from command line. However, I want something more like
exec(response.content)
This doesn't work since it's still compressed, but you get the idea.
I am also open to ideas that replace the zip with some other format of sending the code over internet, if it's easier to execute it from memory.

A possible solution is this
import io
import requests
from zipfile import ZipFile
response = requests.get("http://myurl.com/get_zip")
# Read the contents of the zip into a bytes object.
binary_zip = io.BytesIO(response.content)
# Convert the bytes object into a ZipFile.
zip_file = ZipFile(binary_zip, "r")
# Iterate over all files in the zip (folders should be also ok).
for script_file in zip_file.namelist():
exec(zip_file.read(script_file))
But it is a bit convoluted and probably can be improved.

WinError 32 :The process cannot access the file because it is being used by another process

I have written the following code to extract zip files in a directory and a delete a particular excel file in the extracted directory :
def extractZipFiles(dest_directory):
"This function extracts zip files in the destination directory for further processing"
fileFullPath = dest_directory + '\\'
extractedDirList = list()
for file in os.listdir(dest_directory):
dn = fileFullPath+file
dn = re.sub(r'\.zip$', "", fileFullPath+file) #remove the trailing .zip.
extractedDirList.append(dn)
zf = zipfile.ZipFile(fileFullPath+file, mode='r')
zf.extractall(dn) # extract the contents of that zip to the empty directory
zf.close()
return extractedDirList
def removeSelectedReports(extractedDirList):
"This function removes the selected reports from extracted directory"
for i in range(len(extractedDirList)):
for filename in os.listdir(extractedDirList[i]):
if filename.startswith("ABC_8"):
logger.info("File to be removed::"+filename)
fullPathName= "%s/%s" % (extractedDirList[i],filename)
os.remove(fullPathName)
return
extractedDirList = extractZipFiles(attributionRptDestDir)
logger.info("ZIP FILES EXTRACTED:"+str(extractedDirList))
removeSelectedReports(extractedDirList)
I am getting the following intermittent issue even though I have closed the zip file handler.
[WinError 32] The process cannot access the file because it is being used by another process: '\\\\share\\Workingdirectory\\report.20180517.zip'
Can you please help resolve this issue

You should try to figure out what has the file open. Based on your code, it looks like you are on Microsoft Windows.
I would stop all applications on your workstation, including browsers, run with only a minimum number of apps open, and reproduce the problem. Once reproduced you can use a tool to lists all handles open to a particular file.
A handy utility would be handle.exe, but please use any tool with similar functionality.
Once you find the offending application, you can further investigate why the file is open, and take counter measures.
I would be careful not to close any application which has the file open, until you know it is safe to do so.

Pythonic way to smart-rename files for record-keeping sake?

Using IronPython 2.6 (I'm new), I'm trying to write a program that opens a file, saves it at a series of locations, and then opens/manipulates/re-saves those. It will be run by an upper-level program on a loop, and this entire procedure is designed to catch/preserve corrupted saves so my company can figure out why this glitch of corruption occasionally happens.
I've currently worked out the Open/Save to locations parts of the script and now I need to build a function that opens, checks for corruption, and (if corrupted) moves the file into a subfolder (with an iterative renaming applied, for copies) or (if okay), modifies the file and saves a duplicate, where the process is repeated on the duplicate, sans duplication.
I tell this all for context to the root problem. In my situation, what is the most pythonic, consistent, and windows/unix friendly way to move a file (corrupted) into a subfolder while also renaming it based on the number of pre-existing copies of the file that exist within said subfolder?
In other words:
In a folder structure built as:
C:\Folder\test.txt
C:\Folder\Subfolder
C:\Folder\Subfolder\test.txt
C:\Folder\Subfolder\test01.txt
C:\Folder\Subfolder\test02.txt
C:\Folder\Subfolder\test03.txt
How to I move test.txt such that:
C:\Folder\Subfolder
C:\Folder\Subfolder\test.txt
C:\Folder\Subfolder\test01.txt
C:\Folder\Subfolder\test02.txt
C:\Folder\Subfolder\test03.txt
C:\Folder\Subfolder\test04.txt
In an automated way, so that I can loop my program overnight and have it stack up the corrupted text files I need to save? Note: They're not text files in practice, just example.

assuming you are going to use the convention of incrementally suffinxing numbers to the files:
import os.path
import shutil
def store_copy( file_to_copy, destination):
filename, extension = os.path.splitext( os.path.basename(file_to_copy)
existing_files = [i for in in os.listdir(destination) if i.startswith(filename)]
new_file_name = "%s%02d%s" % (filename, len(existing_files), extension)
shutil.copy2(file_to_copy, os.path.join(destination, new_file_name)
There's a fail case if you have subdirectories or files in destination whose names overlap with the source file, ie, if your file is named 'example.txt' and the destination containst 'example_A.txt' as well as 'example.txt' and 'example01.txt' If that's a possibility you'd have to change the test in the existing files = line to something more sophisticated

How to safely write to a file?

Imagine you have a library for working with some sort of XML file or configuration file. The library reads the whole file into memory and provides methods for editing the content. When you are done manipulating the content you can call a write to save the content back to file. The question is how to do this in a safe way.
Overwriting the existing file (starting to write to the original file) is obviously not safe. If the write method fails before it is done you end up with a half written file and you have lost data.
A better option would be to write to a temporary file somewhere, and when the write method has finished, you copy the temporary file to the original file.
Now, if the copy somehow fails, you still have correctly saved data in the temporary file. And if the copy succeeds, you can remove the temporary file.
On POSIX systems I guess you can use the rename system call which is an atomic operation. But how would you do this best on a Windows system? In particular, how do you handle this best using Python?
Also, is there another scheme for safely writing to files?

If you see Python's documentation, it clearly mentions that os.rename() is an atomic operation. So in your case, writing data to a temporary file and then renaming it to the original file would be quite safe.
Another way could work like this:
let original file be abc.xml
create abc.xml.tmp and write new data to it
rename abc.xml to abc.xml.bak
rename abc.xml.tmp to abc.xml
after new abc.xml is properly put in place, remove abc.xml.bak
As you can see that you have the abc.xml.bak with you which you can use to restore if there are any issues related with the tmp file and of copying it back.

If you want to be POSIXly correct and save you have to:
Write to temporary file
Flush and fsync the file (or fdatasync)
Rename over the original file
Note that calling fsync has unpredictable effects on performance -- Linux on ext3 may stall for disk I/O whole numbers of seconds as a result, depending on other outstanding I/O.
Notice that rename is not an atomic operation in POSIX -- at least not in relation to file data as you expect. However, most operating systems and filesystems will work this way. But it seems you missed the very large linux discussion about Ext4 and filesystem guarantees about atomicity. I don't know exactly where to link but here is a start: ext4 and data loss.
Notice however that on many systems, rename will be as safe in practice as you expect. However it is in a way not possible to get both -- performance and reliability across all possible linux confiugrations!
With a write to a temporary file, then a rename of the temporary file, one would expect the operations are dependent and would be executed in order.
The issue however is that most, if not all filesystems separate metadata and data. A rename is only metadata. It may sound horrible to you, but filesystems value metadata over data (take Journaling in HFS+ or Ext3,4 for example)! The reason is that metadata is lighter, and if the metadata is corrupt, the whole filesystem is corrupt -- the filesystem must of course preserve it self, then preserve the user's data, in that order.
Ext4 did break the rename expectation when it first came out, however heuristics were added to resolve it. The issue is not a failed rename, but a successful rename. Ext4 might sucessfully register the rename, but fail to write out the file data if a crash comes shortly thereafter. The result is then a 0-length file and neither orignal nor new data.
So in short, POSIX makes no such guarantee. Read the linked Ext4 article for more information!

In Win API I found quite nice function ReplaceFile that does what name suggests even with optional back-up. There is always way with DeleteFile, MoveFile combo.
In general what you want to do is really good. And I cannot think of any better write scheme.

A simplistic solution. Use tempfile to create a temporary file and if writing succeeds the just rename the file to your original configuration file.
Note that rename is not atomic across filesystems. You'll have to resort to a slight workaround (e.g. tempfile on target filesystem, followed by a rename) in order to be really safe.
For locking a file, see portalocker.

The standard solution is this.
Write a new file with a similar name. X.ext# for example.
When that file has been closed (and perhaps even read and checksummed), then you two two renames.
X.ext (the original) to X.ext~
X.ext# (the new one) to X.ext
(Only for the crazy paranoids) call the OS sync function to force dirty buffer writes.
At no time is anything lost or corruptable. The only glitch can happen during the renames. But you haven't lost anything or corrupted anything. The original is recoverable right up until the final rename.

Per RedGlyph's suggestion, I'm added an implementation of ReplaceFile that uses ctypes to access the Windows APIs. I first added this to jaraco.windows.api.filesystem.
ReplaceFile = windll.kernel32.ReplaceFileW
ReplaceFile.restype = BOOL
ReplaceFile.argtypes = [
LPWSTR,
LPWSTR,
LPWSTR,
DWORD,
LPVOID,
LPVOID,
]
REPLACEFILE_WRITE_THROUGH = 0x1
REPLACEFILE_IGNORE_MERGE_ERRORS = 0x2
REPLACEFILE_IGNORE_ACL_ERRORS = 0x4
I then tested the behavior using this script.
from jaraco.windows.api.filesystem import ReplaceFile
import os
open('orig-file', 'w').write('some content')
open('replacing-file', 'w').write('new content')
ReplaceFile('orig-file', 'replacing-file', 'orig-backup', 0, 0, 0)
assert open('orig-file').read() == 'new content'
assert open('orig-backup').read() == 'some content'
assert not os.path.exists('replacing-file')
While this only works in Windows, it appears to have a lot of nice features that other replace routines would lack. See the API docs for details.

There's now a codified, pure-Python, and I dare say Pythonic solution to this in the boltons utility library: boltons.fileutils.atomic_save.
Just pip install boltons, then:
from boltons.fileutils import atomic_save
with atomic_save('/path/to/file.txt') as f:
f.write('this will only overwrite if it succeeds!\n')
There are a lot of practical options, all well-documented. Full disclosure, I am the author of boltons, but this particular part was built with a lot of community help. Don't hesitate to drop a note if something is unclear!

You could use the fileinput module to handle the backing-up and in-place writing for you:
import fileinput
for line in fileinput.input(filename,inplace=True, backup='.bak'):
# inplace=True causes the original file to be moved to a backup
# standard output is redirected to the original file.
# backup='.bak' specifies the extension for the backup file.
# manipulate line
newline=process(line)
print(newline)
If you need to read in the entire contents before you can write the newline's,
then you can do that first, then print entire new contents with
newcontents=process(contents)
for line in fileinput.input(filename,inplace=True, backup='.bak'):
print(newcontents)
break
If the script ends abruptly, you will still have the backup.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.