Best way to replace files in place safely and efficiently? - python

I am trying to encrypt a file in place using cryptography module, so I dont have to buffer the ciphertext of the file which can be memory intensive and then i will have to replace the original file with it's encrypted one.so my solution is encrypting a chunk of plaintext then trying to replace it with its ciphertext 16 bytes at a time(AES-CTR mode). The problem seems that the loop is an infinite loop.
so how to fix this.
what other methods you suggest.
What are The side effects of using such a method below.
pointer = 0
with open(path, "r+b") as file:
print("...ENCRYPTING")
while file:
file_data = file.read(16)
pointer += 16
ciphertext = aes_enc.update(file_data)
file.seek(pointer-16)
file.write(ciphertext)
print("...Complete...")

so how to fix this.
As Cyril Jouve already mentions, check for if not file_data
what other methods you suggest.
What are The side effects of using such a method below.
Reading in blocks of 16 bytes is relatively slow. I guess you have enough memory to read larger blocks like 4096, 8192 ...
Unless you have very large files and limited diskspace I think there is no benefit in reading and writing in the same file. In case of an error and if the os has already written data to disk you will have lost the original data and will have an incomplete encrypted file of which you don't know which part is encrypted.
It's easier and saver to create a new encrypted file an then delete and rename if there were no errors.
Encrypt to a new file, catch exceptions, check existence and size of the encrypted file, delete source and rename encrypted file only if all is oké.
import os
path = r'D:\test.dat'
input_path = path
encrypt_path = path + '_encrypt'
try:
with open(input_path, "rb") as input_file:
with open(encrypt_path, "wb") as encrypt_file:
print("...ENCRYPTING")
while True:
file_data = input_file.read(4096)
if not file_data:
break
ciphertext = aes_enc.update(file_data)
encrypt_file.write(ciphertext)
print("...Complete...")
if os.path.exists(encrypt_path):
if os.path.getsize(input_path) == os.path.getsize(encrypt_path):
print(f'Deleting {input_path}')
os.remove(input_path)
print(f'Renaming {encrypt_path} to {input_path}')
os.rename(encrypt_path, input_path)
except Exception as e:
print(f'EXCEPTION: {str(e)}')

there is no "truthiness" for a file object, so you can't use it as the condition for your loop.
The file is at EOF when read() returns an empty bytes object (https://docs.python.org/3/library/io.html#io.BufferedIOBase.read)
with open(path, "r+b") as file:
print("...ENCRYPTING")
while True:
file_data = file.read(16)
if not file_data:
break
ciphertext = aes_enc.update(file_data)
file.seek(-len(file_data), os.SEEK_CUR)
file.write(ciphertext)
print("...Complete...")

Related

reading large file uses 100% memory and my whole pc frozes

I have made a simple app that encrypts and decrypts files. but when i load a large file like a 2gb, my program uses 100% of the memory. I use multiprocessing and multi threading.
poolSize = min(cpu_count(), len(fileList))
process_pool = Pool(poolSize)
thread_pool = ThreadPool(len(fileList))
lock = Lock()
worker = partial(encfile, process_pool, lock)
thread_pool.map(worker, fileList)
def encfile(process_pool, lock, file):
with open(file, 'rb') as original_file:
original = original_file.read()
encrypted = process_pool.apply(encryptfn, args=(key, original,))
with open (file, 'wb') as encrypted_file:
encrypted_file.write(encrypted)
This is my general idea:
Since memory is a problem, you have to read the files in smaller chunks, say 64K pieces and encrypt each 64K block and write those out. Of course, the encrypted block will have a length other than 64K so the problem becomes how to decrypt. So each encrypted block must be prefixed with a fixed-length header that is nothing more than the length of the following encrypted block encoded as a 4-byte unsigned integer (which should be way more than adequate). The decryption algorithm loop first reads the next 4-byte length and then know from that how many bytes long is the encrypted block that follows.
By the way, there is no need to pass to encfile a lock if you are not using it to, for example, count files processed.
from tempfile import mkstemp
from os import fdopen, replace
BLOCKSIZE = 64 * 1024
ENCRYPTED_HEADER_LENGTH = 4
def encfile(process_pool, lock, file):
"""
Encrypt file in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as original_file, \
fdopen (fd, 'wb') as encrypted_file:
while True:
original = original_file.read(BLOCKSIZE)
if not original:
break
encrypted = process_pool.apply(encryptfn, args=(key, original))
l = len(encrypted)
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
encrypted_file.write(l_bytes)
encrypted_file.write(encrypted)
replace(path, file)
def decfile(file):
"""
Decrypt files in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as encrypted_file, \
fdopen (fd, 'wb') as original_file:
while True:
l_bytes = encrypted_file.read(ENCRYPTED_HEADER_LENGTH)
if not l_bytes:
break
l = int.from_bytes(l_bytes, 'big')
encrypted = encrypted_file.read(l)
decrypted = decryptfn(key, encrypted)
original_file.write(decrypted)
replace(path, file)
Explanation
The larger the block size the more memory is required (your original program read the entire file; this program will only read 64K at a time). But I am assuming that too small a block size results in too many calls to the encryption, which is done by multiprocessing and that would require more CPU overhead -- so it's a tradeoff. 64K was arbitrary. Increase by a lot if you have the memory. You can even try 1024 * 1024 (1M).
I attempted to explain this the following before, but let me elaborate:
So let's say when you encrypt a 64K block then encrypted size for that one particular 64K block ends up being 67,986 bytes long (a different 64K block encrypted will in general have a different length unless its unencrypted value happened to have been the same). If I just write out the data with no other information, I would need some way to know that to decrypt the file that it is first necessary to read back 67,986 bytes of data and pass that to the decrypt method (with the correct key, of course) because you have to decrypt the precise results of what was encrypted, no fewer nor no greater bytes. In other words, you can't just read back the encrypted file in arbitrary chunks and pass those chunks to the decrypt method. But what would be that way? So the only way to know how big each encrypted chunk is would be to prefix those chunks with a header that gives the length of the following chunk.
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big') takes the integerlength stored in variable l and encodes it as a byte array of size ENCRYPTED_HEADER_LENGTH in "big endian" order meaning that the bytes are arranged from high order bytes to low order bytes:
>>> ENCRYPTED_HEADER_LENGTH = 4
>>> l = 67986
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
>>> l_bytes
b'\x00\x01\t\x92'
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'little')
>>> l_bytes
b'\x92\t\x01\x00'
>>>
\t is the tab character with a value of \x09 so we would be writing out 0010992, which is a 4-byte hexadecimal value for 67986

python encrypt big file

This script is xor encrypt function, if encrypt small file, is good ,but I have tried to open encrypt a big file (about 5GB) error information:
"OverflowError: size does not fit in an int"
,and open too slow.
Anyone can help me optimization my script,thank you.
from Crypto.Cipher import XOR
import base64
import os
def encrypt():
enpath = "D:\\Software"
key = 'vinson'
for files in os.listdir(enpath):
os.chdir(enpath)
with open(files,'rb') as r:
print ("open success",files)
data = r.read()
print ("loading success",files)
r.close()
cipher = XOR.new(key)
encoding = base64.b64encode(cipher.encrypt(data))
with open(files,'wb+') as n:
n.write(encoding)
n.close()
To expand upon my comment: you don't want to read the file into memory all at once, but process it in smaller blocks.
With any production-grade cipher (which XOR is definitely not) you would need to also deal with padding the output file if the source data is not a multiple of the cipher's block size. This script does not deal with that, hence the assertion about the block size.
Also, we're no longer irreversibly (well, aside from the fact that the XOR cipher is actually directly reversible) overwriting files with their encrypted versions. (Should you want to do that, it'd be better to just add code to remove the original, then rename the encrypted file into its place. That way you won't end up with a half-written, half-encrypted file.)
Also, I removed the useless Base64 encoding.
But – don't use this code for anything serious. Please don't. Friends don't friends roll their own crypto.
from Crypto.Cipher import XOR
import os
def encrypt_file(cipher, source_file, dest_file):
# this toy script is unable to deal with padding issues,
# so we must have a cipher that doesn't require it:
assert cipher.block_size == 1
while True:
src_data = source_file.read(1048576) # 1 megabyte at a time
if not src_data: # ran out of data?
break
encrypted_data = cipher.encrypt(src_data)
dest_file.write(encrypted_data)
def insecurely_encrypt_directory(enpath, key):
for filename in os.listdir(enpath):
file_path = os.path.join(enpath, filename)
dest_path = file_path + ".encrypted"
with open(file_path, "rb") as source_file, open(dest_path, "wb") as dest_file:
cipher = XOR.new(key)
encrypt_file(cipher, source_file, dest_file)

How to encrypt multiple files in Python 2

I've been creating a data-protection program which encrypts all files on a computer using SHA-256. So far, the program is capable of encrypting one specified file (that has been hard-coded into the program) at a time and appending a .enc extension. The only problem here is that the program creates a new file after the encryption instead of saving over the original. So if I encrypt mypass.txt, I will now have mypass.txt as well as mypass.enc, but I need it to convert mypass.txt into mypass.enc. Additionally, if anyone has any idea as to how to encrypt all files as opposed to just one that is hard-coded I would be extremely thankful.
Thanks so much to anyone who has any input, please let me know if you need any additional information.
import os, random, struct
from Crypto.Cipher import AES
def encrypt_file(key, in_filename, out_filename=None, chunksize=64*1024):
if not out_filename:
out_filename = in_filename + '.enc'
iv = ''.join(chr(random.randint(0, 0xFF)) for i in range(16))
encryptor = AES.new(key, AES.MODE_CBC, iv)
filesize = os.path.getsize(in_filename)
with open(in_filename, 'rb') as infile:
with open(out_filename, 'wb') as outfile:
outfile.write(struct.pack('<Q', filesize))
outfile.write(iv)
while True:
chunk = infile.read(chunksize)
if len(chunk) == 0:
break
elif len(chunk) % 16 != 0:
chunk += ' ' * (16 - len(chunk) % 16)
outfile.write(encryptor.encrypt(chunk))
I'm assuming that you want to remove the contents of the original file as best as possible.
After creating the encrypted file, you could overwrite the original file with 0 bytes, and delete it.
Note: This is for a HDD. SSD drives can and will use a different memory block when overwrting a file for the purpose of wear levelling. So overwriting with 0-bytes is not useful on an SSD. For SSD's you should make sure that TRIM is enabled. (How that is done depends on the OS and filesystem used.) The thing is that only the SSD's controller determines when it will re-use a block of memory, obliterating the old contents. So on an SSD you cannot really be sure that file contents are gone.
For the reasons mentioned above, I think that it is a better idea to use an encrypted filesystem for confidential data, rather than encrypting individual file. That way everything that is written to the physical device is encrypted.
As for deleting multiple files, you have several options.
Give the names of the files to be encrypted on the command line. This can be retrieved in your script as sys.args[1:].
Use os.walk to recursively retrieve the paths of all files under the current working directory and encrypt them.
A combination of the two. If a path in sys.args[1:] is a file (test with os.path.isfile), encrypt it. If it is a directory (test with os.path.isdir), use os.walk to find all files in that directory and encrypt them.

How to move pointer to specific bytes and read using Pickle Library in Python?

everyone!
I need a little bit of help in working with Binary Files (.dat) in python.
I am using Pickle Library to input in file which succeeds but when it comes to reading from file, my program does not work.
I need help in;
calculating the file size of the specific file, in bytes.
moving the pointer to specific bytes in file. (using .seek would be better.)
reading a specific byte from file. (using pickle.load would be better.)
looping over the specific file to print all bytes. (I get EOFError using while True: )
Any help would be appreciated.
This is my testing code so far and it has a lot of issues.
import pickle
with open ("BinaryFile.dat" , mode = "ab") as MyFile:
pickle.dump("New" , MyFile)
with open("BinaryFile.dat" , mode = "rb") as MyReadFile:
MyReadFile.seek(3)
NewLine = pickle.load(MyReadFile)
print (NewLine)
input("-> ")
pickle will do everything for you, just check its API
import pickle
with open ('BinaryFile.dat', mode='ab') as MyFile:
pickle.dump('New', MyFile)
pickle.dump([1, 2], MyFile)
pickle.dump(pickle.dump, MyFile)
# etc.
with open('BinaryFile.dat', mode='rb') as MyReadFile:
try:
while 1:
print pickle.load(MyReadFile)
except EOFError:
pass

Why is python adding "" to my filenames when decrypting AES?

I'm not running into any error just the output for my program is doing something strange. I've also noticed this thread here: Encrypt file with AES-256 and Decrypt file to its original format. This is my own approach to this issue, so I hope this isn't considered a duplicate. I'll post my code below, and explain how it functions. (Not including the encryption code)
For Encryption
path = 'files/*'
files = glob.glob(path)
with open('extensions.txt', 'w') as extension:
for listing in files:
endfile = os.path.splitext(listing)[1]
extension.write(endfile + "\n")
extension.close()
for in_filename in files:
out_filename1 = os.path.splitext(in_filename)[0]
out_filename = out_filename1 + '.pycrypt'
with open(in_filename, 'rb') as in_file, open(out_filename, 'wb') as out_file:
encrypt(in_file, out_file, password)
in_file.close()
out_file.close()
os.remove(in_filename)
print 'Files Encrypted'
For Decryption
password = raw_input('Password-> ')
path = 'files/*'
files = glob.glob(path)
for in_filename in files:
f=open('extensions.txt')
lines=f.readlines()
counter+=1
out_filename1 = os.path.splitext(in_filename)[0]
out_filename = out_filename1 + lines[counter]
with open(in_filename, 'rb') as in_file, open(out_filename, 'wb') as out_file:
decrypt(in_file, out_file, password)
in_file.close()
out_file.close()
os.remove(in_filename)
print 'Files Decrypted'
The code takes all the files in a folder, and encrypts them using AES. Then changes all the files extensions to .pycrypt, saving the old extension(s) into a file called "extensions.txt". After decryption it gives the files there extensions back by reading the text file line by line.
Here's the issue, after decryption every file goes from this:
15.png, sam.csv
To this
15.png, sam.csv
I've also noticed that if I re-encrypted the files with the  symbol, the "extensions.txt" go from this:
15.png
sam.csv
bill.jpeg
To this (notice the spaces):
15.png
sam.csv
bill.jpeg
Any ideas what is causing this?
Let's read the documentation (emphasis mine):
file.readline([size])
Read one entire line from the file. A trailing newline character is kept in the string (but may be absent when a file ends with an
incomplete line). [6] If the size argument is present and
non-negative, it is a maximum byte count (including the trailing
newline) and an incomplete line may be returned. When size is not 0,
an empty string is returned only when EOF is encountered immediately.
file.readlines([sizehint])
Read until EOF using readline() and return a list containing the lines thus read. If the optional sizehint argument is present, instead
of reading up to EOF, whole lines totalling approximately sizehint
bytes (possibly after rounding up to an internal buffer size) are
read. Objects implementing a file-like interface may choose to ignore
sizehint if it cannot be implemented, or cannot be implemented efficiently.
This means that lines[counter] doesn't only contain the file extension, but also the newline character after that. You can remove all whitespace at the beginning and end with: lines[counter].strip().
A better way to do this is to encrypt a file "a.jpg" as "a.jpg.enc", so you don't need to store the extension in a separate file.

Categories

Resources