reading large file uses 100% memory and my whole pc frozes

reading large file uses 100% memory and my whole pc frozes - python

I have made a simple app that encrypts and decrypts files. but when i load a large file like a 2gb, my program uses 100% of the memory. I use multiprocessing and multi threading.
poolSize = min(cpu_count(), len(fileList))
process_pool = Pool(poolSize)
thread_pool = ThreadPool(len(fileList))
lock = Lock()
worker = partial(encfile, process_pool, lock)
thread_pool.map(worker, fileList)
def encfile(process_pool, lock, file):
with open(file, 'rb') as original_file:
original = original_file.read()
encrypted = process_pool.apply(encryptfn, args=(key, original,))
with open (file, 'wb') as encrypted_file:
encrypted_file.write(encrypted)

This is my general idea:
Since memory is a problem, you have to read the files in smaller chunks, say 64K pieces and encrypt each 64K block and write those out. Of course, the encrypted block will have a length other than 64K so the problem becomes how to decrypt. So each encrypted block must be prefixed with a fixed-length header that is nothing more than the length of the following encrypted block encoded as a 4-byte unsigned integer (which should be way more than adequate). The decryption algorithm loop first reads the next 4-byte length and then know from that how many bytes long is the encrypted block that follows.
By the way, there is no need to pass to encfile a lock if you are not using it to, for example, count files processed.
from tempfile import mkstemp
from os import fdopen, replace
BLOCKSIZE = 64 * 1024
ENCRYPTED_HEADER_LENGTH = 4
def encfile(process_pool, lock, file):
"""
Encrypt file in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as original_file, \
fdopen (fd, 'wb') as encrypted_file:
while True:
original = original_file.read(BLOCKSIZE)
if not original:
break
encrypted = process_pool.apply(encryptfn, args=(key, original))
l = len(encrypted)
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
encrypted_file.write(l_bytes)
encrypted_file.write(encrypted)
replace(path, file)
def decfile(file):
"""
Decrypt files in place.
"""
fd, path = mkstemp() # make a temporary file
with open(file, 'rb') as encrypted_file, \
fdopen (fd, 'wb') as original_file:
while True:
l_bytes = encrypted_file.read(ENCRYPTED_HEADER_LENGTH)
if not l_bytes:
break
l = int.from_bytes(l_bytes, 'big')
encrypted = encrypted_file.read(l)
decrypted = decryptfn(key, encrypted)
original_file.write(decrypted)
replace(path, file)
Explanation
The larger the block size the more memory is required (your original program read the entire file; this program will only read 64K at a time). But I am assuming that too small a block size results in too many calls to the encryption, which is done by multiprocessing and that would require more CPU overhead -- so it's a tradeoff. 64K was arbitrary. Increase by a lot if you have the memory. You can even try 1024 * 1024 (1M).
I attempted to explain this the following before, but let me elaborate:
So let's say when you encrypt a 64K block then encrypted size for that one particular 64K block ends up being 67,986 bytes long (a different 64K block encrypted will in general have a different length unless its unencrypted value happened to have been the same). If I just write out the data with no other information, I would need some way to know that to decrypt the file that it is first necessary to read back 67,986 bytes of data and pass that to the decrypt method (with the correct key, of course) because you have to decrypt the precise results of what was encrypted, no fewer nor no greater bytes. In other words, you can't just read back the encrypted file in arbitrary chunks and pass those chunks to the decrypt method. But what would be that way? So the only way to know how big each encrypted chunk is would be to prefix those chunks with a header that gives the length of the following chunk.
l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big') takes the integerlength stored in variable l and encodes it as a byte array of size ENCRYPTED_HEADER_LENGTH in "big endian" order meaning that the bytes are arranged from high order bytes to low order bytes:
>>> ENCRYPTED_HEADER_LENGTH = 4
>>> l = 67986
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'big')
>>> l_bytes
b'\x00\x01\t\x92'
>>> l_bytes = l.to_bytes(ENCRYPTED_HEADER_LENGTH, 'little')
>>> l_bytes
b'\x92\t\x01\x00'
>>>
\t is the tab character with a value of \x09 so we would be writing out 0010992, which is a 4-byte hexadecimal value for 67986

Related

Best way to replace files in place safely and efficiently?

I am trying to encrypt a file in place using cryptography module, so I dont have to buffer the ciphertext of the file which can be memory intensive and then i will have to replace the original file with it's encrypted one.so my solution is encrypting a chunk of plaintext then trying to replace it with its ciphertext 16 bytes at a time(AES-CTR mode). The problem seems that the loop is an infinite loop.
so how to fix this.
what other methods you suggest.
What are The side effects of using such a method below.
pointer = 0
with open(path, "r+b") as file:
print("...ENCRYPTING")
while file:
file_data = file.read(16)
pointer += 16
ciphertext = aes_enc.update(file_data)
file.seek(pointer-16)
file.write(ciphertext)
print("...Complete...")

so how to fix this.
As Cyril Jouve already mentions, check for if not file_data
what other methods you suggest.
What are The side effects of using such a method below.
Reading in blocks of 16 bytes is relatively slow. I guess you have enough memory to read larger blocks like 4096, 8192 ...
Unless you have very large files and limited diskspace I think there is no benefit in reading and writing in the same file. In case of an error and if the os has already written data to disk you will have lost the original data and will have an incomplete encrypted file of which you don't know which part is encrypted.
It's easier and saver to create a new encrypted file an then delete and rename if there were no errors.
Encrypt to a new file, catch exceptions, check existence and size of the encrypted file, delete source and rename encrypted file only if all is oké.
import os
path = r'D:\test.dat'
input_path = path
encrypt_path = path + '_encrypt'
try:
with open(input_path, "rb") as input_file:
with open(encrypt_path, "wb") as encrypt_file:
print("...ENCRYPTING")
while True:
file_data = input_file.read(4096)
if not file_data:
break
ciphertext = aes_enc.update(file_data)
encrypt_file.write(ciphertext)
print("...Complete...")
if os.path.exists(encrypt_path):
if os.path.getsize(input_path) == os.path.getsize(encrypt_path):
print(f'Deleting {input_path}')
os.remove(input_path)
print(f'Renaming {encrypt_path} to {input_path}')
os.rename(encrypt_path, input_path)
except Exception as e:
print(f'EXCEPTION: {str(e)}')

there is no "truthiness" for a file object, so you can't use it as the condition for your loop.
The file is at EOF when read() returns an empty bytes object (https://docs.python.org/3/library/io.html#io.BufferedIOBase.read)
with open(path, "r+b") as file:
print("...ENCRYPTING")
while True:
file_data = file.read(16)
if not file_data:
break
ciphertext = aes_enc.update(file_data)
file.seek(-len(file_data), os.SEEK_CUR)
file.write(ciphertext)
print("...Complete...")

Ungzipping chunks of bytes from from S3 using iter_chunks()

I am encountering issues ungzipping chunks of bytes that I am reading from S3 using the iter_chunks() method from boto3. The strategy of ungzipping the file chunk-by-chunk originates from this issue.
The code is as follows:
dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
data = dec.decompress(chunk)
print(len(chunk), len(data))
# 524288 65505
# 524288 0
# 524288 0
# ...
This code initially prints out the value of 65505 followed thereafter by 0 for every subsequent iteration. My understanding is that this code should ungzip each compressed chunk, and then print the length of the uncompressed version.
Is there something I'm missing?

It seems like your input file is block gzip (bgzip http://www.htslib.org/doc/bgzip.html ) because you have a 65k block of data decoded.
GZip files can be concatenated together ( see https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage) and Block GZip uses this to concatenate blocks of the same file, so that by using an associated index only the specific block containing information of interest has to be decoded.
So to stream decode a block gzip file, you need to use the leftover data from one block to start a new one. E.g.
# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
# decompress this chunk of data
data = dec.decompress(chunk)
# bgzip is a concatenation of gzip files
# if there is stuff in this chunk beyond the current block
# it needs to be processed
while len(dec.unused_data):
# end of one block
leftovers = dec.unused_data
# create a new decompressor
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
#decompress the leftovers
data = data+dec.decompress(leftovers)
# TODO handle data

python encrypt big file

This script is xor encrypt function, if encrypt small file, is good ,but I have tried to open encrypt a big file (about 5GB) error information:
"OverflowError: size does not fit in an int"
,and open too slow.
Anyone can help me optimization my script,thank you.
from Crypto.Cipher import XOR
import base64
import os
def encrypt():
enpath = "D:\\Software"
key = 'vinson'
for files in os.listdir(enpath):
os.chdir(enpath)
with open(files,'rb') as r:
print ("open success",files)
data = r.read()
print ("loading success",files)
r.close()
cipher = XOR.new(key)
encoding = base64.b64encode(cipher.encrypt(data))
with open(files,'wb+') as n:
n.write(encoding)
n.close()

To expand upon my comment: you don't want to read the file into memory all at once, but process it in smaller blocks.
With any production-grade cipher (which XOR is definitely not) you would need to also deal with padding the output file if the source data is not a multiple of the cipher's block size. This script does not deal with that, hence the assertion about the block size.
Also, we're no longer irreversibly (well, aside from the fact that the XOR cipher is actually directly reversible) overwriting files with their encrypted versions. (Should you want to do that, it'd be better to just add code to remove the original, then rename the encrypted file into its place. That way you won't end up with a half-written, half-encrypted file.)
Also, I removed the useless Base64 encoding.
But – don't use this code for anything serious. Please don't. Friends don't friends roll their own crypto.
from Crypto.Cipher import XOR
import os
def encrypt_file(cipher, source_file, dest_file):
# this toy script is unable to deal with padding issues,
# so we must have a cipher that doesn't require it:
assert cipher.block_size == 1
while True:
src_data = source_file.read(1048576) # 1 megabyte at a time
if not src_data: # ran out of data?
break
encrypted_data = cipher.encrypt(src_data)
dest_file.write(encrypted_data)
def insecurely_encrypt_directory(enpath, key):
for filename in os.listdir(enpath):
file_path = os.path.join(enpath, filename)
dest_path = file_path + ".encrypted"
with open(file_path, "rb") as source_file, open(dest_path, "wb") as dest_file:
cipher = XOR.new(key)
encrypt_file(cipher, source_file, dest_file)

How to encrypt multiple files in Python 2

I've been creating a data-protection program which encrypts all files on a computer using SHA-256. So far, the program is capable of encrypting one specified file (that has been hard-coded into the program) at a time and appending a .enc extension. The only problem here is that the program creates a new file after the encryption instead of saving over the original. So if I encrypt mypass.txt, I will now have mypass.txt as well as mypass.enc, but I need it to convert mypass.txt into mypass.enc. Additionally, if anyone has any idea as to how to encrypt all files as opposed to just one that is hard-coded I would be extremely thankful.
Thanks so much to anyone who has any input, please let me know if you need any additional information.
import os, random, struct
from Crypto.Cipher import AES
def encrypt_file(key, in_filename, out_filename=None, chunksize=64*1024):
if not out_filename:
out_filename = in_filename + '.enc'
iv = ''.join(chr(random.randint(0, 0xFF)) for i in range(16))
encryptor = AES.new(key, AES.MODE_CBC, iv)
filesize = os.path.getsize(in_filename)
with open(in_filename, 'rb') as infile:
with open(out_filename, 'wb') as outfile:
outfile.write(struct.pack('<Q', filesize))
outfile.write(iv)
while True:
chunk = infile.read(chunksize)
if len(chunk) == 0:
break
elif len(chunk) % 16 != 0:
chunk += ' ' * (16 - len(chunk) % 16)
outfile.write(encryptor.encrypt(chunk))

I'm assuming that you want to remove the contents of the original file as best as possible.
After creating the encrypted file, you could overwrite the original file with 0 bytes, and delete it.
Note: This is for a HDD. SSD drives can and will use a different memory block when overwrting a file for the purpose of wear levelling. So overwriting with 0-bytes is not useful on an SSD. For SSD's you should make sure that TRIM is enabled. (How that is done depends on the OS and filesystem used.) The thing is that only the SSD's controller determines when it will re-use a block of memory, obliterating the old contents. So on an SSD you cannot really be sure that file contents are gone.
For the reasons mentioned above, I think that it is a better idea to use an encrypted filesystem for confidential data, rather than encrypting individual file. That way everything that is written to the physical device is encrypted.
As for deleting multiple files, you have several options.
Give the names of the files to be encrypted on the command line. This can be retrieved in your script as sys.args[1:].
Use os.walk to recursively retrieve the paths of all files under the current working directory and encrypt them.
A combination of the two. If a path in sys.args[1:] is a file (test with os.path.isfile), encrypt it. If it is a directory (test with os.path.isdir), use os.walk to find all files in that directory and encrypt them.

How to determine the Content-Length of a gzipped file in Python?

I've a big compressed file and I want to know the size of the content without uncompress it. I've tried this:
import gzip
import os
with gzip.open(data_file) as f:
f.seek(0, os.SEEK_END)
size = f.tell()
but I get this error
ValueError: Seek from end not supported
How can I do that?
Thx.

It is not possible in principle to definitively determine the size of the uncompressed data in a gzip file without decompressing it. You do not need to have the space to store the uncompressed data -- you can discard it as you go along. But you have to decompress it all.
If you control the source of the gzip file and can assure that a) there are no concatenated members in the gzip file, b) the uncompressed data is less than 4 GB in length, and c) there is no extraneous junk at the end of the gzip file, then and only then you can read the last four bytes of the gzip file to get a little-endian integer that has the length of the uncompressed data.
See this answer for more details.
Here is Python code to read a gzip file and print the uncompressed length, without having to store or save the uncompressed data. It limits the memory usage to small buffers. This requires Python 3.3 or greater:
#!/usr/local/bin/python3.4
import sys
import zlib
import warnings
f = open(sys.argv[1], "rb")
total = 0
buf = f.read(1024)
while True: # loop through concatenated gzip streams
z = zlib.decompressobj(15+16)
while True: # loop through one gzip stream
while True: # go through all output from one input buffer
total += len(z.decompress(buf, 4096))
buf = z.unconsumed_tail
if buf == b"":
break
if z.eof:
break # end of a gzip stream found
buf = f.read(1024)
if buf == b"":
warnings.warn("incomplete gzip stream")
break
buf = z.unused_data
z = None
if buf == b"":
buf == f.read(1024)
if buf == b"":
break
print(total)

Unfortunately, the Python 2.x gzip module doesn't appear to support any way of determining uncompressed file size.
However, gzip does store the uncompressed file size as a little-endian 32-bit unsigned integer at the very end of the file: http://www.abeel.be/content/determine-uncompressed-size-gzip-file
Unfortunately, this only works for files <4gb in size due to using only a 32-bit integer the gzip format; see the manual.
import os
import struct
with open(data_file,"rb") as f:
f.seek(-4, os.SEEK_END)
size, = struct.unpack("<I", f.read(4))
print size

To summerize, I need to open huges compressed files (> 4GB) so the technique of Dan won't work and I want the length (number of line) of the file so the technique of Mark Adler is not appropriate.
Eventually, I found for uncompressed files a solution( not the most optimized but it works!) which can be transposed easily to compressed files:
size = 0
with gzip.open(data_file) as f:
for line in f:
size+= 1
pass
return size
Thank you all, people in this forum are very effective!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

reading large file uses 100% memory and my whole pc frozes - python

Related

Best way to replace files in place safely and efficiently?

Ungzipping chunks of bytes from from S3 using iter_chunks()

python encrypt big file

How to encrypt multiple files in Python 2

How to determine the Content-Length of a gzipped file in Python?

Categories

Resources