I've been creating a data-protection program which encrypts all files on a computer using SHA-256. So far, the program is capable of encrypting one specified file (that has been hard-coded into the program) at a time and appending a .enc extension. The only problem here is that the program creates a new file after the encryption instead of saving over the original. So if I encrypt mypass.txt, I will now have mypass.txt as well as mypass.enc, but I need it to convert mypass.txt into mypass.enc. Additionally, if anyone has any idea as to how to encrypt all files as opposed to just one that is hard-coded I would be extremely thankful.
Thanks so much to anyone who has any input, please let me know if you need any additional information.
import os, random, struct
from Crypto.Cipher import AES
def encrypt_file(key, in_filename, out_filename=None, chunksize=64*1024):
if not out_filename:
out_filename = in_filename + '.enc'
iv = ''.join(chr(random.randint(0, 0xFF)) for i in range(16))
encryptor = AES.new(key, AES.MODE_CBC, iv)
filesize = os.path.getsize(in_filename)
with open(in_filename, 'rb') as infile:
with open(out_filename, 'wb') as outfile:
outfile.write(struct.pack('<Q', filesize))
outfile.write(iv)
while True:
chunk = infile.read(chunksize)
if len(chunk) == 0:
break
elif len(chunk) % 16 != 0:
chunk += ' ' * (16 - len(chunk) % 16)
outfile.write(encryptor.encrypt(chunk))
I'm assuming that you want to remove the contents of the original file as best as possible.
After creating the encrypted file, you could overwrite the original file with 0 bytes, and delete it.
Note: This is for a HDD. SSD drives can and will use a different memory block when overwrting a file for the purpose of wear levelling. So overwriting with 0-bytes is not useful on an SSD. For SSD's you should make sure that TRIM is enabled. (How that is done depends on the OS and filesystem used.) The thing is that only the SSD's controller determines when it will re-use a block of memory, obliterating the old contents. So on an SSD you cannot really be sure that file contents are gone.
For the reasons mentioned above, I think that it is a better idea to use an encrypted filesystem for confidential data, rather than encrypting individual file. That way everything that is written to the physical device is encrypted.
As for deleting multiple files, you have several options.
Give the names of the files to be encrypted on the command line. This can be retrieved in your script as sys.args[1:].
Use os.walk to recursively retrieve the paths of all files under the current working directory and encrypt them.
A combination of the two. If a path in sys.args[1:] is a file (test with os.path.isfile), encrypt it. If it is a directory (test with os.path.isdir), use os.walk to find all files in that directory and encrypt them.
Related
This question already has answers here:
Hashing a file in Python
(9 answers)
Closed 10 months ago.
I am currently doing a project where I turn my Pi (Model 4 2GB) into a sort of NAS archive. I decided to learn a bit of Python along the way and wrote a small console app to "manage" my data base. One function I added was that it hashes the files in the database so it knows when files are corrupted.
To achieve this I hash a file like this:
with open(file, "rb") as f:
rbytes = f.read()
readable_hash = sha256(rbytes).hexdigest()
Now when I run this on smaller files it works just fine but on large files like videos it spits out a MemoryError - I presume this is because it doesn't have enough RAM to hold the file?
I've seen that you can break the read up into chunks but does this also work for hashing? If so, how?
Also I'm not a programmer. I want to learn in the process, so the simpler the solution the better - I want to actually understand the code I use. :) Doesn't need to be a super fast algorithm that squeezes out every millisecond either, as long as it gets the job done.
Thanks for any help in advance!
One Solution is adding a part of the File with another already hashed file, the hash at the end will still consist of the File there a just a few extra steps.
import hashlib
def hash_file(filename, bytes):
hashed = "" #make a string
with open(filename, 'rb') as f: #read from file
while True: #read the defined number of bytes until the loop is closed/broke
chunk = f.read(bytes) #read bytes
if chunk: #as long as "chunk" is not None/Empty
hashed = str(hashlib.md5(str(chunk).encode() + hashed.encode()).digest()) #Hash the old Hash and append the newly hashed chunk of text
else:
break #stop the Loop
return hashed
print(hash_file('file.txt', 1000))
By Hashing the Contents over and over again we always create a string that originates from the old string/hash, this way the string is always new and smaller (because MD5 hashes always have the same size) than the Whole File while being basically the old file.
PS: the bytes variable can be anything but more bytes = more Memory while less bytes = longer compute time, try what fits your needs. 1000–9000 Seems to be a good spot.
I have made a simple app that encrypts and decrypts files. but when i load a large file like a 2gb, my program uses 100% of the memory. I use multiprocessing and multi threading.
poolSize = min(cpu_count(), len(fileList))
process_pool = Pool(poolSize)
thread_pool = ThreadPool(len(fileList))
lock = Lock()
worker = partial(encfile, process_pool, lock)
thread_pool.map(worker, fileList)
def encfile(process_pool, lock, file):
with open(file, 'rb') as original_file:
original = original_file.read()
encrypted = process_pool.apply(encryptfn, args=(key, original,))
with open (file, 'wb') as encrypted_file:
encrypted_file.write(encrypted)
This is my general idea:
Since memory is a problem, you have to read the files in smaller chunks, say 64K pieces and encrypt each 64K block and write those out. Of course, the encrypted block will have a length other than 64K so the problem becomes how to decrypt. So each encrypted block must be prefixed with a fixed-length header that is nothing more than the length of the following encrypted block encoded as a 4-byte unsigned integer (which should be way more than adequate). The decryption algorithm loop first reads the next 4-byte length and then know from that how many bytes long is the encrypted block that follows.
By the way, there is no need to pass to encfile a lock if you are not using it to, for example, count files processed.
from tempfile import mkstemp
from os import fdopen, replace
BLOCKSIZE = 64 * 1024
ENCRYPTED_HEADER_LENGTH = 4
def encfile(process_pool, lock, file):
"""
Encrypt file in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as original_file, \
fdopen (fd, 'wb') as encrypted_file:
while True:
original = original_file.read(BLOCKSIZE)
if not original:
break
encrypted = process_pool.apply(encryptfn, args=(key, original))
l = len(encrypted)
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
encrypted_file.write(l_bytes)
encrypted_file.write(encrypted)
replace(path, file)
def decfile(file):
"""
Decrypt files in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as encrypted_file, \
fdopen (fd, 'wb') as original_file:
while True:
l_bytes = encrypted_file.read(ENCRYPTED_HEADER_LENGTH)
if not l_bytes:
break
l = int.from_bytes(l_bytes, 'big')
encrypted = encrypted_file.read(l)
decrypted = decryptfn(key, encrypted)
original_file.write(decrypted)
replace(path, file)
Explanation
The larger the block size the more memory is required (your original program read the entire file; this program will only read 64K at a time). But I am assuming that too small a block size results in too many calls to the encryption, which is done by multiprocessing and that would require more CPU overhead -- so it's a tradeoff. 64K was arbitrary. Increase by a lot if you have the memory. You can even try 1024 * 1024 (1M).
I attempted to explain this the following before, but let me elaborate:
So let's say when you encrypt a 64K block then encrypted size for that one particular 64K block ends up being 67,986 bytes long (a different 64K block encrypted will in general have a different length unless its unencrypted value happened to have been the same). If I just write out the data with no other information, I would need some way to know that to decrypt the file that it is first necessary to read back 67,986 bytes of data and pass that to the decrypt method (with the correct key, of course) because you have to decrypt the precise results of what was encrypted, no fewer nor no greater bytes. In other words, you can't just read back the encrypted file in arbitrary chunks and pass those chunks to the decrypt method. But what would be that way? So the only way to know how big each encrypted chunk is would be to prefix those chunks with a header that gives the length of the following chunk.
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big') takes the integerlength stored in variable l and encodes it as a byte array of size ENCRYPTED_HEADER_LENGTH in "big endian" order meaning that the bytes are arranged from high order bytes to low order bytes:
>>> ENCRYPTED_HEADER_LENGTH = 4
>>> l = 67986
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
>>> l_bytes
b'\x00\x01\t\x92'
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'little')
>>> l_bytes
b'\x92\t\x01\x00'
>>>
\t is the tab character with a value of \x09 so we would be writing out 0010992, which is a 4-byte hexadecimal value for 67986
I have a .gz file and I need to get the name of files inside it using python.
This question is the same as this one
The only difference is that my file is .gz not .tar.gz so the tarfile library did not help me here
I am using requests library to request a URL. The response is a compressed file.
Here is the code I am using to download the file
response = requests.get(line.rstrip(), stream=True)
if response.status_code == 200:
with open(str(base_output_dir)+"/"+str(current_dir)+"/"+str(count)+".gz", 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
This code downloads the file with name 1.gz for example. Now if I opened the file with an archive manger the file will contain something like my_latest_data.json
I need to extract the file and the output be my_latest_data.json.
Here is the code I am using to extract the file
inF = gzip.open(f, 'rb')
outfilename = f.split(".")[0]
outF = open(outfilename, 'wb')
outF.write(inF.read())
inF.close()
outF.close()
The outputfilename variable is a string I provide in the script but I need the real file name (my_latest_data.json)
You can't, because Gzip is not an archive format.
That's a bit of a crap explanation on its own, so let me break this down a bit more than I did in the comment...
Its just compression
Being "just a compression system" means that Gzip operates on input bytes (usually from a file) and outputs compressed bytes. You cannot know whether or not the bytes inside represent multiple files or just a single file -- it is just a stream of bytes that has been compressed. That is why you can accept gzipped data over a network, for example. Its bytes_in -> bytes_out.
What's a manifest?
A manifest is a header within an archive that acts as a table of contents for the archive. Note that now I am using the term "archive" and not "compressed stream of bytes". An archive implies that it is a collection of files or segments that are referred to by a manifest -- a compressed stream of bytes is just a stream of bytes.
What's inside a Gzip, anyway?
A somewhat simplified description of a .gz file's contents is:
A header with a special number to indicate its a gzip, a version and a timestamp (10 bytes)
Optional headers; usually including the original filename (if the compression target was a file)
The body -- some compressed payload
A CRC-32 checksum at the end (8 bytes)
That's it. No manifest.
Archive formats, on the other hand, will have a manifest inside. That's where the tar library would come in. Tar is just a way to shove a bunch of bits together into a single file, and places a manifest at the front that lets you know the names of the original files and what sizes they were before being concatenated into the archive. Hence, .tar.gz being so common.
There are utilities that allow you to decompress parts of a gzipped file at a time, or decompress it only in memory to then let you examine a manifest or whatever that may be inside. But the details of any manifest are specific to the archive format contained inside.
Note that this is different from a zip archive. Zip is an archive format, and as such contains a manifest. Gzip is a compression library, like bzip2 and friends.
As noted in the other answer, your question can only make sense if I take out the plural: "I have a .gz file and I need to get the name of file inside it using python."
A gzip header may or may not have a file name in it. The gzip utility will normally ignore the name in the header, and decompress to a file with the same name as the .gz file, but with the .gz stripped. E.g. your 1.gz would decompress to a file named 1, even if the header has the file name my_latest_data.json in it. The -N option of gzip will use the file name in the header (as well as the time stamp in the header), if there is one. So gzip -dN 1.gz would create the file my_latest_data.json, instead of 1.
You can find the file name in the header in Python by processing the header manually. You can find the details in the gzip specification.
Verify that the first three bytes are 1f 8b 08.
Save the fourth byte. Call it flags. If flags & 8 is zero, then give up -- there is no file name in the header.
Skip the next six bytes.
If flags & 2 is not zero, skip two bytes.
If flags & 4 is not zero, then read the next two bytes. Considering them to be in little endian order, make an integer out of those two bytes, calling it xlen. Then skip xlen bytes.
We already know that flags & 8 is not zero, so you are now at the file name. Read bytes until you get to zero byte. Those bytes up to, but not including the zero byte are the file name.
Note: This answer is obsolete as of Python 3.
Using the tips from the Mark Adler reply and a bit of inspection on gzip module I've set up this function that extracts the internal filename from gzip files. I noticed that GzipFile objects have a private method called _read_gzip_header() that almost gets the filename so i did based on that
import gzip
def get_gzip_filename(filepath):
f = gzip.open(filepath)
f._read_gzip_header()
f.fileobj.seek(0)
f.fileobj.read(3)
flag = ord(f.fileobj.read(1))
mtime = gzip.read32(f.fileobj)
f.fileobj.read(2)
if flag & gzip.FEXTRA:
# Read & discard the extra field, if present
xlen = ord(f.fileobj.read(1))
xlen = xlen + 256*ord(f.fileobj.read(1))
f.fileobj.read(xlen)
filename = ''
if flag & gzip.FNAME:
while True:
s = f.fileobj.read(1)
if not s or s=='\000':
break
else:
filename += s
return filename or None
The Python 3 gzip library discards this information but you could adopt the code from around the link to do something else with it.
As noted in other answers on this page, this information is optional anyway. But it's not impossible to retrieve if you need to look if it's there.
import struct
def gzinfo(filename):
# Copy+paste from gzip.py line 16
FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT = 1, 2, 4, 8, 16
with open(filename, 'rb') as fp:
# Basically copy+paste from GzipFile module line 429f
magic = fp.read(2)
if magic == b'':
return False
if magic != b'\037\213':
raise ValueError('Not a gzipped file (%r)' % magic)
method, flag, _last_mtime = struct.unpack("<BBIxx", fp.read(8))
if method != 8:
raise ValueError('Unknown compression method')
if flag & FEXTRA:
# Read & discard the extra field, if present
extra_len, = struct.unpack("<H", fp.read(2))
fp.read(extra_len)
if flag & FNAME:
fname = []
while True:
s = fp.read(1)
if not s or s==b'\000':
break
fname.append(s.decode('latin-1'))
return ''.join(fname)
def main():
from sys import argv
for filename in argv[1:]:
print(filename, gzinfo(filename))
if __name__ == '__main__':
main()
This replaces the exceptions in the original code with a vague ValueError exception (you might want to fix that if you intend to use this more broadly, and turn this into a proper module you can import) and uses the generic read() function instead of the specific _read_exact() method which goes through some trouble to ensure that it got exactly the number of bytes it requested (this too could be lifted over if you wanted to).
I am trying to encrypt a file in place using cryptography module, so I dont have to buffer the ciphertext of the file which can be memory intensive and then i will have to replace the original file with it's encrypted one.so my solution is encrypting a chunk of plaintext then trying to replace it with its ciphertext 16 bytes at a time(AES-CTR mode). The problem seems that the loop is an infinite loop.
so how to fix this.
what other methods you suggest.
What are The side effects of using such a method below.
pointer = 0
with open(path, "r+b") as file:
print("...ENCRYPTING")
while file:
file_data = file.read(16)
pointer += 16
ciphertext = aes_enc.update(file_data)
file.seek(pointer-16)
file.write(ciphertext)
print("...Complete...")
so how to fix this.
As Cyril Jouve already mentions, check for if not file_data
what other methods you suggest.
What are The side effects of using such a method below.
Reading in blocks of 16 bytes is relatively slow. I guess you have enough memory to read larger blocks like 4096, 8192 ...
Unless you have very large files and limited diskspace I think there is no benefit in reading and writing in the same file. In case of an error and if the os has already written data to disk you will have lost the original data and will have an incomplete encrypted file of which you don't know which part is encrypted.
It's easier and saver to create a new encrypted file an then delete and rename if there were no errors.
Encrypt to a new file, catch exceptions, check existence and size of the encrypted file, delete source and rename encrypted file only if all is oké.
import os
path = r'D:\test.dat'
input_path = path
encrypt_path = path + '_encrypt'
try:
with open(input_path, "rb") as input_file:
with open(encrypt_path, "wb") as encrypt_file:
print("...ENCRYPTING")
while True:
file_data = input_file.read(4096)
if not file_data:
break
ciphertext = aes_enc.update(file_data)
encrypt_file.write(ciphertext)
print("...Complete...")
if os.path.exists(encrypt_path):
if os.path.getsize(input_path) == os.path.getsize(encrypt_path):
print(f'Deleting {input_path}')
os.remove(input_path)
print(f'Renaming {encrypt_path} to {input_path}')
os.rename(encrypt_path, input_path)
except Exception as e:
print(f'EXCEPTION: {str(e)}')
there is no "truthiness" for a file object, so you can't use it as the condition for your loop.
The file is at EOF when read() returns an empty bytes object (https://docs.python.org/3/library/io.html#io.BufferedIOBase.read)
with open(path, "r+b") as file:
print("...ENCRYPTING")
while True:
file_data = file.read(16)
if not file_data:
break
ciphertext = aes_enc.update(file_data)
file.seek(-len(file_data), os.SEEK_CUR)
file.write(ciphertext)
print("...Complete...")
This script is xor encrypt function, if encrypt small file, is good ,but I have tried to open encrypt a big file (about 5GB) error information:
"OverflowError: size does not fit in an int"
,and open too slow.
Anyone can help me optimization my script,thank you.
from Crypto.Cipher import XOR
import base64
import os
def encrypt():
enpath = "D:\\Software"
key = 'vinson'
for files in os.listdir(enpath):
os.chdir(enpath)
with open(files,'rb') as r:
print ("open success",files)
data = r.read()
print ("loading success",files)
r.close()
cipher = XOR.new(key)
encoding = base64.b64encode(cipher.encrypt(data))
with open(files,'wb+') as n:
n.write(encoding)
n.close()
To expand upon my comment: you don't want to read the file into memory all at once, but process it in smaller blocks.
With any production-grade cipher (which XOR is definitely not) you would need to also deal with padding the output file if the source data is not a multiple of the cipher's block size. This script does not deal with that, hence the assertion about the block size.
Also, we're no longer irreversibly (well, aside from the fact that the XOR cipher is actually directly reversible) overwriting files with their encrypted versions. (Should you want to do that, it'd be better to just add code to remove the original, then rename the encrypted file into its place. That way you won't end up with a half-written, half-encrypted file.)
Also, I removed the useless Base64 encoding.
But – don't use this code for anything serious. Please don't. Friends don't friends roll their own crypto.
from Crypto.Cipher import XOR
import os
def encrypt_file(cipher, source_file, dest_file):
# this toy script is unable to deal with padding issues,
# so we must have a cipher that doesn't require it:
assert cipher.block_size == 1
while True:
src_data = source_file.read(1048576) # 1 megabyte at a time
if not src_data: # ran out of data?
break
encrypted_data = cipher.encrypt(src_data)
dest_file.write(encrypted_data)
def insecurely_encrypt_directory(enpath, key):
for filename in os.listdir(enpath):
file_path = os.path.join(enpath, filename)
dest_path = file_path + ".encrypted"
with open(file_path, "rb") as source_file, open(dest_path, "wb") as dest_file:
cipher = XOR.new(key)
encrypt_file(cipher, source_file, dest_file)