Hash files in Python larger than available RAM? [duplicate]

Hash files in Python larger than available RAM? [duplicate] - python

This question already has answers here:
Hashing a file in Python
(9 answers)
Closed 10 months ago.
I am currently doing a project where I turn my Pi (Model 4 2GB) into a sort of NAS archive. I decided to learn a bit of Python along the way and wrote a small console app to "manage" my data base. One function I added was that it hashes the files in the database so it knows when files are corrupted.
To achieve this I hash a file like this:
with open(file, "rb") as f:
rbytes = f.read()
readable_hash = sha256(rbytes).hexdigest()
Now when I run this on smaller files it works just fine but on large files like videos it spits out a MemoryError - I presume this is because it doesn't have enough RAM to hold the file?
I've seen that you can break the read up into chunks but does this also work for hashing? If so, how?
Also I'm not a programmer. I want to learn in the process, so the simpler the solution the better - I want to actually understand the code I use. :) Doesn't need to be a super fast algorithm that squeezes out every millisecond either, as long as it gets the job done.
Thanks for any help in advance!

One Solution is adding a part of the File with another already hashed file, the hash at the end will still consist of the File there a just a few extra steps.
import hashlib
def hash_file(filename, bytes):
hashed = "" #make a string
with open(filename, 'rb') as f: #read from file
while True: #read the defined number of bytes until the loop is closed/broke
chunk = f.read(bytes) #read bytes
if chunk: #as long as "chunk" is not None/Empty
hashed = str(hashlib.md5(str(chunk).encode() + hashed.encode()).digest()) #Hash the old Hash and append the newly hashed chunk of text
else:
break #stop the Loop
return hashed
print(hash_file('file.txt', 1000))
By Hashing the Contents over and over again we always create a string that originates from the old string/hash, this way the string is always new and smaller (because MD5 hashes always have the same size) than the Whole File while being basically the old file.
PS: the bytes variable can be anything but more bytes = more Memory while less bytes = longer compute time, try what fits your needs. 1000–9000 Seems to be a good spot.

Related

How to store read-in data element by element in a memory-efficient way?

The program I'm working on needs to read in data files which can be quite large (up to 5GB) in ASCII. The format can vary that's why I came up with using readline(), split every line to get just the pure entries, append them all to one big list of strings and divide this one in smaller string lists depending on the occurrence of certain marker words, and then pass the data to a program internal data structure for further unified processing.
This method is working good enough except that it needs way to much memory and I wonder why.
So I wrote this little test case which makes you understand my problem:
The input data here is the text of Shakespears Romeo and Juliet (actually I expect mixed alphabetic - numeric input) - note that I want you to copy the data yourself to keep things clearly. The script generates a .txt file which is then read in again using. The original memory size in this case is 153 KB.
Reading this file with...
f.read() gives you a single string with a size of 153 KB, too.
f.readlines() gives you a list with single strings for every line with a overall size of 420 KB.
Splitting the line strings of f.readlines() at every whiespace and save all those single entries in a new list results in 1619 KB in memory use.
As these numbers don't seem to be a problem in this cases, a factor of >10 in increase of RAM requirement is definitly one for input data in GB order.
I don't have any idea why this is or how to avoid this. From my point of understanding a list is just a structure of pointers pointing on all the values stored in the list (this is also the reason, why sys.getsizeof() on a list gives you a 'wrong' result).
For the values themself it shouldn't make a difference in memory if I have "LONG STRING" or "LONG" + "STRING" as both use the same characters which should result in the same amount of bits/bytes.
Maybe the answer is really simple but I am really stuck with this problem so I am thankful for every idea.
# step1: http://shakespeare.mit.edu/romeo_juliet/full.html
# step2: Ctrl+A and then Ctrl+C
# step3: Ctrl+V after benchmarkText
benchmarkText = """ >>INSERT ASCII DATA HERE<< """
#=== import modules =======================================
from pympler import asizeof
import sys
#=== open files and safe data to a structure ==============
#-- original memory size
print("\n\nAll memory sizes are in KB:\n")
print("Original string size:")
print(asizeof.asizeof(benchmarkText)/1e3)
print(sys.getsizeof(benchmarkText)/1e3)
#--- write bench mark file
with open('benchMarkText.txt', 'w') as f:
f.write(benchmarkText)
#--- read the whole file (should always be equal to original size)
with open('benchMarkText.txt', 'r') as f:
# read the whole file as one string
wholeFileString = f.read()
# check size:
print("\nSize using f.read():")
print(asizeof.asizeof(wholeFileString)/1e3)
#--- read the file in a list
listOfWordOrNumberStrings = []
with open('benchMarkText.txt', 'r') as f:
# safe every line of the file
listOfLineStrings = f.readlines()
print("\nSize using f.readlines():")
print(asizeof.asizeof(listOfLineStrings)/1e3)
# split every line into the words or punctation marks
for stringLine in listOfLineStrings:
line = stringLine[:-1] # get rid of the '\n'
# line = re.sub('"', '', line) # The final implementation will need this, however for the test case it doesn't matter.
elemsInLine = line.split()
for elem in elemsInLine:
listOfWordOrNumberStrings.append(elem)
# check size
print("\nSize after splitting:")
print(asizeof.asizeof(listOfWordOrNumberStrings)/1e3)
(I am aware that I use readlines() instead of readline() here - I changed it for this test case because I think it makes things easier to understand.)

Read N number of bytes from stdin of python and output to a temp file for further processing

I would like to read a fixed number of bytes from stdin of a python script and output it to one temporary file batch by batch for further processing. Therefore, when the first N number of bytes are passed to the temp file, I want it to execute the subsequent scripts and then read the next N bytes from stdin. I am not sure what to iterate over in the top loop before While true. This is an example of what I tried.
import sys
While True:
data = sys.stdin.read(2330049) # Number of bytes I would like to read in one iteration
if data == "":
break
file1=open('temp.fil','wb') #temp file
file1.write(data)
file1.close()
further_processing on temp.fil (I think this can only be done after file1 is closed)

Two quick suggestions:
You should pretty much never do While True
Python3
Are you trying to read from a file? or from actual standard in? (Like the output of a script piped to this?)
Here is an answer I think will work for you, if you are reading from a file, that I pieced together from some other answers listed at the bottom:
with open("in-file", "rb") as in_file, open("out-file", "wb") as out_file:
data = in_file.read(2330049)
while byte != "":
out_file.write(data)
If you want to read from actual standard in, I would read all of it in, then split it up by bytes. The only way this won't work is if you are trying to deal with constant streaming data...which I would most definitely not use standard in for.
The .encode('UTF-8') and .decode('hex') methods might be of use to you also.
Sources: https://stackoverflow.com/a/1035360/957648 & Python, how to read bytes from file and save it?

Python - split files

I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?
----Adding current code----
if not os.path.exists(output_path):
os.makedirs(output_path)
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
outFile = open('output/tempfile', 'wb')
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
outFile.write(chunk)
f = open('output/tempfile', 'rb').read().split('\r\n\r\n')
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.
Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.
from pprint import pprint
someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''
n = 16
# emulate a stream by creating 37 blocks of 16 bytes
byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
pprint(byteBlocks)
# this string is present twice, but both times it is split across two bytearrays
matchBytes = bytearray('requests.post()')
# our buffer
buff = bytearray()
count = 0
for bb in byteBlocks:
buff += bb
count += 1
# every two blocks
if (count % 2) == 0:
if count == 2:
start = 0
else:
start = len(matchBytes)
# check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
# this will check all the bytes only once...
if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))
Update:
So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.
Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.
Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
match = '\r\n\r\n'
data = ''
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
data += chunk
f = data.split(match)
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Is there a straightforward way to write to a file open in r+ mode without overwriting existing bytes?

I have a text file test.txt, with the following contents:
Thing 1. string
And I'm creating a python file that will increment the number every time it gets run without affecting the rest of the string, like so.
Run once:
Thing 2. string
Run twice:
Thing 3. string
Run three times:
Thing 4. string
Run four times:
Thing 5. string
This is the code that I'm using to accomplish this.
file = open("test.txt","r+")
started = False
beginning = 0 #start of the digits
done = False
num = 0
#building the number from digits
while not done:
next = file.read(1)
if ord(next) in range(48, 58): #ascii values of 0-9
started = True
num *= 10
num += int(next)
elif started: #has reached the end of the number
done = True
else: #has not reached the beginning of the number
beginning += 1
num += 1
file.seek(beginning,0)
file.write(str(num))
This code works, so long as the number is not 10^n-1 (9, 99, 999, etc) because in those cases, it writes more bytes than were previously in the number. As such, it will override the characters that follow.
So this brings me to the point. I need a way to write to the file that overwrites previously bytes, which I have, and a way to write to the file that does not overwrite previously existing bytes, which I don't have. Does such a mechanism exist in python, and if so, what is it?
I have already tried opening the file using the line file = open("test.txt","a+") instead. When I do that, it always writes to the end, regardless of the seek point.
file = open("test.txt","w+") will not work because I need to keep the contents of the file while altering it, and files opened in any variant of w mode are wiped clean.
I have also thought of solving my problem using a function like this:
#file is assumed to be in r+ mode
def write(string, file, index = -1):
if index != -1:
file.seek(index, 0)
remainder = file.read()
file.seek(index)
file.write(remainder + string)
But I also want to be able to expand the solution to larger files, and reading the rest of the file single-handedly changes what I'm trying to accomplish from being O(1) to O(n). It also seems very non-Pythonic, since it seeks to accomplish the task in a less-than-straightforward way.
It would also make my I/O operations inconsistent: I would have class methods (file.read() and file.write()) to read from the file and write to it replacing old characters, but an external function to insert without replacing.
If I make the code inline, rather than a function, it means I have to write several of the same lines of code every time I try to write without replacing, which is also non-Pythonic.
To reiterate my question, is there a more straightforward way to do this, or am I stuck with the function?

Unfortunately, what you want to do is not possible. This is a limitation at a lower level than Python, in the operating system. Neither the Unix nor the Windows file access API offers any way to insert new bytes in the middle of a file without overwriting the bytes that were already there.
Reading the rest of the file and rewriting it is the usual workaround. Actually, the usual workaround is to rewrite the entire file under a new name and then use rename to move it back to the old name. On Unix, this accomplishes an atomic file update - unless the computer crashes, concurrent readers will see either the new file or the old file, not some hybrid. (Windows, sadly, still does not allow you to rename over a name that already exists, so if you use this strategy you have to delete the old file first, opening an unavoidable race window where the file might appear not to exist at all.)
Yes, this is O(N), and yes, if you use the write-new-file-and-rename strategy it temporarily consumes scratch disk space equal to the size of the file (old or new, whichever is larger). That's just how it is.
I haven't thought about it enough to give you even a sketch of the code, but it should be possible to use context managers to wrap up the write-new-file-and-rename approach tidily.

No, the disk doesn't work like you think it does.
You have to remember that your file is stored on disk as one contiguous
chunk of data*
Your disk happens to be wound up in a great big spool, a bit like a record,
but if you were to unwind your file, you'd get something that looks like
this:
+------------------------------------------------------------+
| Thing 1. String |
+------------------------------------------------------------+
^ ^
^ | \_, ^
| Start of file End of File |
Start of disk End of disk
As you've discovered, there's no way to simply insert data in the middle.
Generally speaking, that wouldn't be possible at all, without physically
altering your disk. And who wants to do that? Especially when just flipping
the magnetic bits on your disk is so much easier and faster. In order to
do what you want to do, you have to read the bytes the you want to
overwrite, then start writing down your new ones. It might look something
like this:
Open the file
Seek to the point of insert
Read the current byte
Seek backward one byte
Write down the first byte of the new string
Read the next byte
Seek backward one byte
Write down the next byte of the new string
Repeat until all the bytes have been written to disk
close the file
Of course, this might be a little bit on the slow side, due to all the
seeking back & forth in the file. It might be faster to read each line,
and then seek back to the previous location in the file. It should be
relatively straightforward to implement something like this in Python,
but as you've discovered, there are system limitations that Python can't
really overcome.
*Unless the files are fragmented, but we're living in an ideal
world where gravity adheres to 9.8m/s2 and the Earth is a perfect
sphere.

how to get tell() to work

I'm trying to open a file and read from the last point read. My files are rather big (20 Mb to ~ 1 Gb) After doing some research it seems that tell() and seek() would be one of the most efficient ways to perform this. I've tried the following code
opened = open(filename, "rU")
f1 = csv.reader(opened)
k = []
for line in f1:
k.append(opened.tell())
When I do this every value in the list is 8272 Long. Does that mean that I cannot use this implementation? Is there something I'm missing? Thanks for your help!
I'm running python 2.7 in Windows 7
Update
After piecing together everything learned here and trial and error I get the following code
opened = open(filename, "rU")
k = [0]
where = 1
for switch in opened:
where += len(switch) + 1
f = StringIO.StringIO(switch)
interesting = csv.reader(f, delimiter=',')
good_values = interesting.next()
k.append(where)
return k
This allows the user to know exactly where in the file to go to while still being able to parse it according to its format. I'm not completely sure of why the offsets need to be constantly added (It seems that the newline is not accurately accounted for in len()).

It looks like the csv.reader is reading the file in chunks of 8272 bytes, that's why you see this number returned from opened.tell() many times - until, I guess, you have read all the lines from your file in the range of 0-8272. After that you will see 8272*2 a few times, exact number will depend on the length of the lines in the buffer read.
So, basically, in your program, tell() doesn't give you offsets of new CSV lines, as you seem to assume. It's only telling you about offset of the end of the file's region currently read into an internal OS buffer used by system functions used to implement the Python's IO functions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hash files in Python larger than available RAM? [duplicate] - python

Related

How to store read-in data element by element in a memory-efficient way?

Read N number of bytes from stdin of python and output to a temp file for further processing

Python - split files

Is there a straightforward way to write to a file open in r+ mode without overwriting existing bytes?

how to get tell() to work

Categories

Resources