python encrypt big file

python encrypt big file - python

This script is xor encrypt function, if encrypt small file, is good ,but I have tried to open encrypt a big file (about 5GB) error information:
"OverflowError: size does not fit in an int"
,and open too slow.
Anyone can help me optimization my script,thank you.
from Crypto.Cipher import XOR
import base64
import os
def encrypt():
enpath = "D:\\Software"
key = 'vinson'
for files in os.listdir(enpath):
os.chdir(enpath)
with open(files,'rb') as r:
print ("open success",files)
data = r.read()
print ("loading success",files)
r.close()
cipher = XOR.new(key)
encoding = base64.b64encode(cipher.encrypt(data))
with open(files,'wb+') as n:
n.write(encoding)
n.close()

To expand upon my comment: you don't want to read the file into memory all at once, but process it in smaller blocks.
With any production-grade cipher (which XOR is definitely not) you would need to also deal with padding the output file if the source data is not a multiple of the cipher's block size. This script does not deal with that, hence the assertion about the block size.
Also, we're no longer irreversibly (well, aside from the fact that the XOR cipher is actually directly reversible) overwriting files with their encrypted versions. (Should you want to do that, it'd be better to just add code to remove the original, then rename the encrypted file into its place. That way you won't end up with a half-written, half-encrypted file.)
Also, I removed the useless Base64 encoding.
But – don't use this code for anything serious. Please don't. Friends don't friends roll their own crypto.
from Crypto.Cipher import XOR
import os
def encrypt_file(cipher, source_file, dest_file):
# this toy script is unable to deal with padding issues,
# so we must have a cipher that doesn't require it:
assert cipher.block_size == 1
while True:
src_data = source_file.read(1048576) # 1 megabyte at a time
if not src_data: # ran out of data?
break
encrypted_data = cipher.encrypt(src_data)
dest_file.write(encrypted_data)
def insecurely_encrypt_directory(enpath, key):
for filename in os.listdir(enpath):
file_path = os.path.join(enpath, filename)
dest_path = file_path + ".encrypted"
with open(file_path, "rb") as source_file, open(dest_path, "wb") as dest_file:
cipher = XOR.new(key)
encrypt_file(cipher, source_file, dest_file)

Related

how many bytes should I read from files stored locally for hash check

def get_chunks(self):
fd = open(self.path, 'rb')
while True:
chunk = fd.read(1)
if not chunk:
break
yield chunk
fd.close()
def calculate(self):
md5 = hashlib.md5()
sha256 = hashlib.sha256()
sha512 = hashlib.sha512()
for chunk in self.get_chunks():
md5.update(chunk)
sha256.update(chunk)
sha512.update(chunk)
self.md5 = md5.hexdigest()
self.sha256 = sha256.hexdigest()
self.sha512 = sha512.hexdigest()
I am using multiple hash algorithms to try to identify files on my computer, I have encountered a problem while making this tool in python, the problem is:
"How many bytes should I read at one time if I intend to use them to update multiple hash algorithms and still keep the idea for this piece of code to run correctly on as many platforms as possible?"
any suggestions please?

I would go for one hash algorithm to hash each complete files. And in order to avoid a collision problems compare the colliding files completely bytewise. This should put you in a safe spot.

Best way to replace files in place safely and efficiently?

I am trying to encrypt a file in place using cryptography module, so I dont have to buffer the ciphertext of the file which can be memory intensive and then i will have to replace the original file with it's encrypted one.so my solution is encrypting a chunk of plaintext then trying to replace it with its ciphertext 16 bytes at a time(AES-CTR mode). The problem seems that the loop is an infinite loop.
so how to fix this.
what other methods you suggest.
What are The side effects of using such a method below.
pointer = 0
with open(path, "r+b") as file:
print("...ENCRYPTING")
while file:
file_data = file.read(16)
pointer += 16
ciphertext = aes_enc.update(file_data)
file.seek(pointer-16)
file.write(ciphertext)
print("...Complete...")

so how to fix this.
As Cyril Jouve already mentions, check for if not file_data
what other methods you suggest.
What are The side effects of using such a method below.
Reading in blocks of 16 bytes is relatively slow. I guess you have enough memory to read larger blocks like 4096, 8192 ...
Unless you have very large files and limited diskspace I think there is no benefit in reading and writing in the same file. In case of an error and if the os has already written data to disk you will have lost the original data and will have an incomplete encrypted file of which you don't know which part is encrypted.
It's easier and saver to create a new encrypted file an then delete and rename if there were no errors.
Encrypt to a new file, catch exceptions, check existence and size of the encrypted file, delete source and rename encrypted file only if all is oké.
import os
path = r'D:\test.dat'
input_path = path
encrypt_path = path + '_encrypt'
try:
with open(input_path, "rb") as input_file:
with open(encrypt_path, "wb") as encrypt_file:
print("...ENCRYPTING")
while True:
file_data = input_file.read(4096)
if not file_data:
break
ciphertext = aes_enc.update(file_data)
encrypt_file.write(ciphertext)
print("...Complete...")
if os.path.exists(encrypt_path):
if os.path.getsize(input_path) == os.path.getsize(encrypt_path):
print(f'Deleting {input_path}')
os.remove(input_path)
print(f'Renaming {encrypt_path} to {input_path}')
os.rename(encrypt_path, input_path)
except Exception as e:
print(f'EXCEPTION: {str(e)}')

there is no "truthiness" for a file object, so you can't use it as the condition for your loop.
The file is at EOF when read() returns an empty bytes object (https://docs.python.org/3/library/io.html#io.BufferedIOBase.read)
with open(path, "r+b") as file:
print("...ENCRYPTING")
while True:
file_data = file.read(16)
if not file_data:
break
ciphertext = aes_enc.update(file_data)
file.seek(-len(file_data), os.SEEK_CUR)
file.write(ciphertext)
print("...Complete...")

How to encrypt multiple files in Python 2

I've been creating a data-protection program which encrypts all files on a computer using SHA-256. So far, the program is capable of encrypting one specified file (that has been hard-coded into the program) at a time and appending a .enc extension. The only problem here is that the program creates a new file after the encryption instead of saving over the original. So if I encrypt mypass.txt, I will now have mypass.txt as well as mypass.enc, but I need it to convert mypass.txt into mypass.enc. Additionally, if anyone has any idea as to how to encrypt all files as opposed to just one that is hard-coded I would be extremely thankful.
Thanks so much to anyone who has any input, please let me know if you need any additional information.
import os, random, struct
from Crypto.Cipher import AES
def encrypt_file(key, in_filename, out_filename=None, chunksize=64*1024):
if not out_filename:
out_filename = in_filename + '.enc'
iv = ''.join(chr(random.randint(0, 0xFF)) for i in range(16))
encryptor = AES.new(key, AES.MODE_CBC, iv)
filesize = os.path.getsize(in_filename)
with open(in_filename, 'rb') as infile:
with open(out_filename, 'wb') as outfile:
outfile.write(struct.pack('<Q', filesize))
outfile.write(iv)
while True:
chunk = infile.read(chunksize)
if len(chunk) == 0:
break
elif len(chunk) % 16 != 0:
chunk += ' ' * (16 - len(chunk) % 16)
outfile.write(encryptor.encrypt(chunk))

I'm assuming that you want to remove the contents of the original file as best as possible.
After creating the encrypted file, you could overwrite the original file with 0 bytes, and delete it.
Note: This is for a HDD. SSD drives can and will use a different memory block when overwrting a file for the purpose of wear levelling. So overwriting with 0-bytes is not useful on an SSD. For SSD's you should make sure that TRIM is enabled. (How that is done depends on the OS and filesystem used.) The thing is that only the SSD's controller determines when it will re-use a block of memory, obliterating the old contents. So on an SSD you cannot really be sure that file contents are gone.
For the reasons mentioned above, I think that it is a better idea to use an encrypted filesystem for confidential data, rather than encrypting individual file. That way everything that is written to the physical device is encrypted.
As for deleting multiple files, you have several options.
Give the names of the files to be encrypted on the command line. This can be retrieved in your script as sys.args[1:].
Use os.walk to recursively retrieve the paths of all files under the current working directory and encrypt them.
A combination of the two. If a path in sys.args[1:] is a file (test with os.path.isfile), encrypt it. If it is a directory (test with os.path.isdir), use os.walk to find all files in that directory and encrypt them.

Converting Images with Python

I'm having problems getting images to convert out of bytes/strings/etc. I can turn an image into a string, or a byte array, or use b64encode on it, but when I try decode/revert it back to an image, it never works. I've tried a lot of things, locally converting an image and then reconverting it, saving it under a different name. However, the resulting files will never actually show anything. (black on Linux, "can't display image" on windows)
My most basic b64encoding script is as follows:
import base64
def convert(image):
f = open(image)
data = f.read()
f.close()
string = base64.b64encode(data)
convertit = base64.b64decode(string)
t = open("Puppy2.jpg", "w+")
t.write(convertit)
t.close()
if __name__ == "__main__":
convert("Puppy.jpg")
I've been stuck on this for a while. I'm sure it's a simple solution, but being new to Python, it's been a bit difficult trying to sort things out.
If it helps with any insight, the end goal here is to transfer images over a network, possibly MQTT.
Any help is much appreciated. Thanks!
Edit** This is in Python 2.7.
Edit 2** Wow, you guys move fast. What a great intro to the community - thanks a lot for the quick responses and super fast results!

For python3, you need to open and write in binary mode:
def convert(image):
f = open(image,"rb")
data = f.read()
f.close()
string = base64.b64encode(data)
convert = base64.b64decode(string)
t = open("Puppy2.jpg", "wb")
t.write(convert)
t.close()
Using python 2 on linux, simply r and w should work fine. On windows you need to do the same as above.
from the docs:
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
You can also write your code a little more succinctly by using with to open your files which will automatically close them for you:
from base64 import b64encode, b64decode
def convert(image):
with open(image, "rb") as f, open("Puppy2.jpg", "wb") as t:
conv = b64decode(b64encode(f.read()))
t.write(conv)

import base64
def convert(image):
f = open(image)
data = f.read()
f.close()
return data
if __name__ == "__main__":
data = convert("Puppy2.jpg")
string = base64.b64encode(data)
convert = base64.b64decode(string)
t = open("Puppy2.jpg", "w+")
t.write(convert)
t.close()

Hashing a file in Python

I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:
import hashlib
inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()
md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()
sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()
print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

TL;DR use buffers to not use tons of memory.
We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!
import sys
import hashlib
# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536 # lets read stuff in 64kb chunks!
md5 = hashlib.md5()
sha1 = hashlib.sha1()
with open(sys.argv[1], 'rb') as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
md5.update(data)
sha1.update(data)
print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))
What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!
You can test this with:
$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d bigfile
Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python
Addendum!
In general when writing python it helps to get into the habit of following [pep-8][4]. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.

For the correct and efficient computation of the hash value of a file (in Python 3):
Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
Use readinto() to avoid buffer churning.
Example:
import hashlib
def sha256sum(filename):
h = hashlib.sha256()
b = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
while n := f.readinto(mv):
h.update(mv[:n])
return h.hexdigest()
Note that the while loop uses an assignment expression which isn't available in Python versions older than 3.8.
With older Python 3 versions you can use an equivalent variation:
import hashlib
def sha256sum(filename):
h = hashlib.sha256()
b = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
for n in iter(lambda : f.readinto(mv), 0):
h.update(mv[:n])
return h.hexdigest()

I would propose simply:
def get_digest(file_path):
h = hashlib.sha256()
with open(file_path, 'rb') as file:
while True:
# Reading is buffered, so we can read smaller chunks.
chunk = file.read(h.block_size)
if not chunk:
break
h.update(chunk)
return h.hexdigest()
All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.

Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.
import hashlib
import mmap
def sha256sum(filename):
h = hashlib.sha256()
with open(filename, 'rb') as f:
with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
h.update(mm)
return h.hexdigest()

I have programmed a module wich is able to hash big files with different algorithms.
pip3 install py_essentials
Use the module like this:
from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

You do not need to define a function with 5-20 lines of code to do this! Save your time by using the pathlib and hashlib libraries, also py_essentials is another solution, but third-parties are *****.
from pathlib import Path
import hashlib
filepath = '/path/to/file'
filebytes = Path(filepath).read_bytes()
filehash_sha1 = hashlib.sha1(filebytes)
filehash_md5 = hashlib.md5(filebytes)
print(f'MD5: {filehash_md5}')
print(f'SHA1: {filehash_sha1}')
I used a few variables here to show the steps, you know how to avoid it.
What do you think about the below function?
from pathlib import Path
import hashlib
def compute_filehash(filepath: str, hashtype: str) -> str:
"""Computes the requested hash for the given file.
Args:
filepath: The path to the file to compute the hash for.
hashtype: The hash type to compute.
Available hash types:
md5, sha1, sha224, sha256, sha384, sha512, sha3_224,
sha3_256, sha3_384, sha3_512, shake_128, shake_256
Returns:
A string that represents the hash.
Raises:
ValueError: If the hash type is not supported.
"""
if hashtype not in ['md5', 'sha1', 'sha224', 'sha256', 'sha384',
'sha512', 'sha3_224', 'sha3_256', 'sha3_384',
'sha3_512', 'shake_128', 'shake_256']:
raise ValueError(f'Hash type {hashtype} is not supported.')
return getattr(hashlib, hashtype)(
Path(filepath).read_bytes()).hexdigest()

FWIW, I prefer this version, which has the same memory and performance characteristics as maxschlepzig's answer but is more readable IMO:
import hashlib
def sha256sum(filename, bufsize=128 * 1024):
h = hashlib.sha256()
buffer = bytearray(bufsize)
# using a memoryview so that we can slice the buffer without copying it
buffer_view = memoryview(buffer)
with open(filename, 'rb', buffering=0) as f:
while True:
n = f.readinto(buffer_view)
if not n:
break
h.update(buffer_view[:n])
return h.hexdigest()

Starting Python 3.11, you can use file_digest() method, which takes responsibility of reading files:
import hashlib
with open(inputFile, "rb") as f:
digest = hashlib.file_digest(f, "sha256")

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
print(h2,file=e)
with open("encrypted.txt","r") as e:
p = e.readline().strip()
print(p)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python encrypt big file - python

Related

how many bytes should I read from files stored locally for hash check

Best way to replace files in place safely and efficiently?

How to encrypt multiple files in Python 2

Converting Images with Python

Hashing a file in Python

Categories

Resources