Why is the python zipfile module faster than C?

Why is the python zipfile module faster than C? - python

I am writing a module that needs to be able to deal with a large number of zip file pretty fast. As such, I was going to use something implemented in C rather than Python (from which I'll be calling the extractor). To try and test which method would be fastest, I wrote a test script comparing linux's 'unzip' command vs the czipfile python module (wrapper around a c zip extractor). As a control, I used the native python zipfile module.
The script creates a zipfile that's around 100MB out of 100 ~1MB files. It looks at 3 scenarios. A) The files are all just random bytestrings. B)The files are just random hex characters C)The files are uniform random sentences with line breaks.
In all cases, the performance of zipfile (implemented in python) was on par with or significantly better than the two extractors implemented in c.
Any ideas why this could be happening? The script is attached. Requires czipfile and the 'unzip' command available in the shell.
from datetime import datetime
import zipfile
import czipfile
import os, binascii, random
class ZipTestError(Exception):
pass
class ZipTest:
procs = ['zipfile', 'czipfile', 'os']
type_map = {'r':'Random', 'h':'Random Hex', 's':'Sentences'}
# three types. t=='r' is random noise files directly out of urandom. t=='h' is urandom noise converted to ascii characters. t=='s' are randomly constructed sentences with line breaks.
def __init__(self):
print """Testing Random Byte Files:
"""
self.test('r')
self.test('h')
self.test('s')
#staticmethod
def rand_name():
return binascii.b2a_hex(os.urandom(10))
def make_file(self, t):
f_name = self.rand_name()
f = open(f_name, 'w')
if t == 'r':
f.write(os.urandom(1048576))
elif t == 'h':
f.write(binascii.b2a_hex(os.urandom(1048576)))
elif t == 's':
for i in range(76260):
ops = ['dog', 'cat', 'rat']
ops2 = ['meat', 'wood', 'fish']
n1 = int(random.random()*10) % 3
n2 = int(random.random()*10) % 3
sentence = """The {0} eats {1}
""".format(ops[n1], ops2[n2])
f.write(sentence)
else:
raise ZipTestError('Invalid Type')
f.close()
return f_name
#create a ~100MB zip file to test extraction on.
def create_zip_test(self, t):
self.file_names = []
self.output_names = []
for i in range(100):
self.file_names.append(self.make_file(t))
self.zip_name = self.rand_name()
output = zipfile.ZipFile(self.zip_name, 'w', zipfile.ZIP_DEFLATED)
for f in self.file_names:
output.write(f)
output.close()
def clean_up(self, rem_zip = False):
for f in self.file_names:
os.remove(f)
self.file_names = []
for f in self.output_names:
os.remove(f)
self.output_names = []
if rem_zip:
if getattr(self, 'zip_name', False):
os.remove(self.zip_name)
self.zip_name = False
def display_res(self, res, t):
print """
{0} results:
""".format(self.type_map[t])
for p in self.procs:
print"""
{0} = {1} milliseconds""".format(p, str(res[p]))
def test(self, t):
self.create_zip_test(t)
res = self.unzip()
self.display_res(res, t)
self.clean_up(rem_zip = True)
def unzip(self):
res = dict()
for p in self.procs:
self.clean_up()
res[p] = getattr(self, "unzip_with_{0}".format(p))()
return res
def unzip_with_zipfile(self):
return self.unzip_with_python(zipfile)
def unzip_with_czipfile(self):
return self.unzip_with_python(czipfile)
def unzip_with_python(self, mod):
f = open(self.zip_name)
zf = mod.ZipFile(f)
start = datetime.now()
op = './'
for name in zf.namelist():
zf.extract(name,op)
self.output_names.append(name)
end = datetime.now()
total = end-start
ms = total.microseconds
ms += (total.seconds) * 1000000
return ms /1000
def unzip_with_os(self):
f = open(self.zip_name)
start = datetime.now()
zf = zipfile.ZipFile(f)
for name in zf.namelist():
self.output_names.append(name)
os.system("unzip -qq {0}".format(f.name))
end = datetime.now()
total = end-start
ms = total.microseconds
ms += (total.seconds) * 1000000
return ms /1000
if __name__ == '__main__':
ZipTest()

As was pointed out above, the decryption is done in python, not the decompression. So zipfile is just using a c implementation like the other two.

Even if C is generally faster than interpreted languages, assuming the algorithm is the same, different buffering strategies can make a difference. Here's some evidence:
I made a couple of changes to your script. The diff is below.
I made the stopwatch start just before os.system. This change is not noticeable, since reading off entries from the Central Directory is quick. So I saved the zip files and measured unzip time with the time shell builtin, outside Python. The result shows that the overhead of firing up new processes doesn't matter so much.
A more interesting change is the addition of libarchive. The results I obtained are like so (milliseconds):
Random Hex Sentences
zipfile 368 1909 604
czipfile 241 1600 2313
os 707 2225 784
shell-measured 797 2272 737
libarchive 248 1513 451
EXTRACTION METHOD
Note that results vary by some milliseconds every time. The shell measures real, user, and sys time (see What do 'real', 'user' and 'sys' mean in the output of time(1)?). The figures above reflect real time, for consistency with other measurements.
A better analysis of what system calls unzip issues can be achieved by strace -c -w. It shows a spike of reads for Hex:
Random Hex Sentences
read 805 14597 12816
write 2600 3200 1600
SYSTEM CALLS ISSUED BY unzip
Now for the diff (it assumes the original script is named ziptest.py in the same directory where you run patch < _diff_, see patch, diff)
--- ziptest.py.orig 2017-05-25 10:36:03.106994889 +0200
+++ ziptest.py 2017-05-25 11:30:42.032598259 +0200
## -2,6 +2,7 ##
import zipfile
import czipfile
import os, binascii, random
+import libarchive.public
class ZipTestError(Exception):
pass
## -10,7 +11,7 ##
class ZipTest:
- procs = ['zipfile', 'czipfile', 'os']
+ procs = ['zipfile', 'czipfile', 'os', 'libarchive']
type_map = {'r':'Random', 'h':'Random Hex', 's':'Sentences'}
# three types. t=='r' is random noise files directly out of urandom. t=='h' is urandom noise converted to ascii characters. t=='s' are randomly constructed sentences with line breaks.
## -119,10 +120,10 ##
def unzip_with_os(self):
f = open(self.zip_name)
- start = datetime.now()
zf = zipfile.ZipFile(f)
for name in zf.namelist():
self.output_names.append(name)
+ start = datetime.now()
os.system("unzip -qq {0}".format(f.name))
end = datetime.now()
total = end-start
## -130,7 +131,15 ##
ms += (total.seconds) * 1000000
return ms /1000
-
+ def unzip_with_libarchive(self):
+ start = datetime.now()
+ for entry in libarchive.public.file_pour(self.zip_name):
+ self.output_names.append(str(entry))
+ end = datetime.now()
+ total = end-start
+ ms = total.microseconds
+ ms += (total.seconds) * 1000000
+ return ms /1000

Related

Python multicore CSV short program, advice/help needed

I'm a hobby coder started with AHK, then some java and now I try to learn Python. I have searched and found some tips but I have yet not been able to implement it into my own code.
Hopefully someone here can help me, it's a very short program.
I'm using .txt csv database with ";" as a separator.
DATABASE EXAMPLE:
Which color is normally a cat?;Black
How tall was the longest man on earth?;272 cm
Is the earth round?;Yes
The database now consists of 20.000 lines which makes the program "to slow", only using 25% CPU (1 core).
If I can make it use all 4 cores (100%) I guess it would perform the task alot faster. The task is basically to compare the CLIPBOARD with the database and if there is a match, it should give me an answer as a return. Perhaps also I can separate the database into 4 pieces?
The code right now looks like this! Not more then 65 lines and its doing its job (but to slow). Advice on how I can make this process into multi core needed.
import time
import pyperclip as pp
import pandas as pd
import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy
ratio_threshold = 90
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
def load_db():
while True:
try:
# Read and create database
db = pd.read_csv(db_file_path, sep=db_separator, encoding=db_encoding)
db = db.drop_duplicates()
return db
except:
print("Error in load_db(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
def top_answers(db, question):
db['ratio'] = db['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = db.sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
return db_sorted
def write_txt(top):
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar.txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
def main():
try:
db = load_db()
last_db_reload = time.time()
while True:
# Get contents of clipboard
question = pp.paste()
# Rank answer
top = top_answers(db, question)
# If answer was found, show results
if len(top) > 0:
write_txt(top)
time.sleep(fall_back_time)
except:
print("Error in main(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
if name == 'main':
main()'

If you could divide the db into four equally large you could process them in parallel like this:
import time
import pyperclip as pp
import pandas as pd
import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy
import threading
ratio_threshold = 90
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
def worker(thread_id, question):
thread_id = str(thread_id)
db = pd.read_csv(db_file_path + thread_id, sep=db_separator, encoding=db_encoding)
db = db.drop_duplicates()
db['ratio'] = db['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = db.sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
top = db_sorted
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar" + thread_id + ".txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
return
def main():
question = pp.paste()
for i in range(1, 4):
t = threading.Thread(target=worker, args=(i, question))
t.start()
t.join()
if name == 'main':
main()

The solution with multiprocessing:
import time
import pyperclip as pp
import pandas as pd
#import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy as np
# pathos uses better pickle to tranfer more complicated objects
from pathos.multiprocessing import Pool
from functools import reduce
import sys
import os
from contextlib import closing
ratio_threshold = 70
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
chunked_db = []
NUM_PROCESSES = os.cpu_count()
def load_db():
while True:
try:
# Read and create database
db = pd.read_csv(db_file_path, sep=db_separator, encoding=db_encoding)
db.columns = ['question', 'answer']
#db = db.drop_duplicates() # i drop it for experiment
break
except:
print("Error in load_db(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
# split database into equal chunks:
# (if you have a lot of RAM, otherwise you
# need to compute ranges in db, something like
# chunk_size = len(db)//NUM_PROCESSES
# ranges[i] = (i*chunk_size, (i+1)*cjunk_size)
# and pass ranges in original db to processes
chunked_db = np.split(db, [NUM_PROCESSES], axis=0)
return chunked_db
def top_answers_multiprocessed(question, chunked_db):
# on unix, python uses 'fork' mode by default
# so the process has 'copy-on-change' access to all global variables
# i.e. if process will change something in db, it will be copied to it
# with a lot of overhead
# Unfortunately, I'fe heard that on Windows only 'spawn' mode with full
# copy of everything is used
# Process pipeline uses pickle, it's quite slow.
# so on small database you may not have benefit from multiprocessing
# If you are going to transfer big objects in or out, look
# in the direction of multiprocessing.Array
# this solution is not fully efficient,
# as pool is recreated each time
# You can create daemon processes which will monitor
# Queue for incoming questions, but it's harder to implement
def top_answers(idx):
# question is in the scope of parent function,
chunked_db[idx]['ratio'] = chunked_db[idx]['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = chunked_db[idx].sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
return db_sorted
with closing(Pool(processes=NUM_PROCESSES)) as pool:
# chunked_db is a list of databases
# they are in global scope, we send only index beacause
# all the data set is pickled
num_chunks = len(chunked_db)
# apply function top_answers across generator range(num_chunks)
res = pool.imap_unordered(top_answers, range(num_chunks))
res = list(res)
# now res is list of dataframes, let's join it
res_final = reduce(lambda left,right: pd.merge(left,right,on='ratio'), res)
return res_final
def write_txt(top):
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar.txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
def mainfunc():
global chunked_db
chunked_db = load_db()
last_db_reload = time.time()
print('db loaded')
last_clip = ""
while True:
# Get contents of clipboard
try:
new_clip = pp.paste()
except:
continue
if (new_clip != last_clip) and (len(new_clip)> 0):
print(new_clip)
last_clip = new_clip
question = new_clip.strip()
else:
continue
# Rank answer
top = top_answers_multiprocessed(question, chunked_db)
# If answer was found, show results
if len(top) > 0:
#write_txt(top)
print(top)
if __name__ == '__main__':
mainfunc()

Why is my multithreaded parser not multithreading?

I have a script that parses xml files using the ElementTree Path Evaluator. It works fine as it is, but it takes a long for it to finish. So I tried to make a multithreaded implementation:
import fnmatch
import operator
import os
import lxml.etree
from nltk import FreqDist
from nltk.corpus import stopwords
from collections import defaultdict
from datetime import datetime
import threading
import Queue
STOPWORDS = stopwords.words('dutch')
STOPWORDS.extend(stopwords.words('english'))
DIR_NAME = 'A_DIRNAME'
PATTERN = '*.A_PATTERN'
def loadData(dir_name, pattern):
nohyphen_files = []
dir_names = []
dir_paths = []
for root, dirnames, filenames in os.walk(dir_name):
dir_names.append(dirnames)
dir_paths.append(root)
for filename in fnmatch.filter(filenames, pattern):
nohyphen_files.append(os.path.join(root, filename))
return nohyphen_files, dir_names, dir_paths
def freq(element_list, descending = True):
agglomerated = defaultdict(int)
for e in element_list:
agglomerated[e] += 1
return sorted(agglomerated.items(), key=operator.itemgetter(1), reverse=descending)
def lexDiv(amount_words):
return 1.0*len(set(amount_words))/len(amount_words)
def anotherFreq(list_types, list_words):
fd = FreqDist(list_types)
print 'top 10 most frequent types:'
for t, freq in fd.items()[:10]:
print t, freq
print '\ntop 10 most frequent words:'
agglomerated = defaultdict(int)
for w in list_words:
if not w.lower() in STOPWORDS:
agglomerated[w] += 1
sorted_dict = sorted(agglomerated.items(), key=operator.itemgetter(1),reverse=True)
print sorted_dict[:10]
def extractor(f):
print "check file: {}".format(f)
try:
# doc = lxml.etree.ElementTree(lxml.etree.XML(f))
doc = lxml.etree.ElementTree(file=f)
except lxml.etree.XMLSyntaxError, e:
print e
return
doc_evaluator = lxml.etree.XPathEvaluator(doc)
entities = doc_evaluator('//entity/*/externalRef/#reference')
places_dbpedia = doc_evaluator('//entity[contains(#type, "Schema:Place")]/*/externalRef/#reference')
non_people_dbpedia = set(doc_evaluator('//entity[not(contains(#type, "Schema:Person"))]'))
people = doc_evaluator('//entity[contains(#type, "Schema:Person")]/*/externalRef/#reference')
words = doc.xpath('text/wf[re:match(text(), "[A-Za-z-]")]/text()',\
namespaces={"re": "http://exslt.org/regular-expressions"})
unique_words = set(words)
other_tokens = doc.xpath('text/wf[re:match(text(), "[^A-Za-z-]")]/text()',\
namespaces={"re": "http://exslt.org/regular-expressions"})
amount_of_sentences = doc_evaluator('text/wf/#sent')[-1]
types = doc_evaluator('//term/#morphofeat')
longest_sentence = freq(doc.xpath('text/wf[re:match(text(), "[A-Za-z-]")]/#sent',\
namespaces={"re": "http://exslt.org/regular-expressions"}))[0]
top_people = freq([e.split('/')[-1] for e in people])[:10]
top_entities = freq([e.split('/')[-1] for e in entities])[:10]
top_places = freq([e.split('/')[-1] for e in places_dbpedia])[:10]
def worker():
while 1:
job_number = q.get()
extractor(job_number)
q.task_done() #this thread is complete, move on
if __name__ =='__main__':
startTime = datetime.now()
files, dirs, path = loadData(DIR_NAME, PATTERN)
startTime = datetime.now()
q = Queue.Queue()# job queue
for f in files:
q.put(f)
for i in range(20): #make 20 workerthreads ready
worker_thread = threading.Thread(target=worker)
worker_thread.daemon = True
worker_thread.start()
q.join()
print datetime.now() - startTime
This does something, but when timing it, it isn't faster than the normal version. I think it has something to do with opening and reading files making the threader not multithreaded. If I use a function that instead of parsing the xml file just sleeps for a couple of second and prints something, it does work and it is a lot faster. What do I have to account for to have a multithreaded XML parser?

Threading in Python doesn't work as it does in other languages. It relies on the Global Interpreter Lock that makes sure only one thread is active at one time (running bytecode to be exact).
What you want to do is use the multiprocess library, instead.
You can read more about the GIL and Threading here:
https://docs.python.org/2/glossary.html#term-global-interpreter-lock
https://docs.python.org/2/library/threading.html

using multiprocessing to compress large number of files

I am trying to compress around 95 files each of size 7 gigs using python multiprocessing module:
import os;
from shutil import copyfileobj;
import bz2;
import multiprocessing as mp
import pprint
from numpy.core.test_rational import numerator
''' Input / Output Path '''
ipath = 'E:/AutoConfirm/'
opath = 'E:/compressed-autoconfirm/'
''' Number of Processes '''
num_of_proc = 6
def compressFile(fileName,chunkSize=100000000):
global ipath
print 'Started Compressing %s to %s'%(fileName,opath)
inp = open(ipath+fileName,'rb')
output = bz2.BZ2File(opath+fileName.split('/')[-1].strip('.csv')+'.bz2','wb',compresslevel=9)
copyfileobj(inp,output,chunkSize)
print 'Finished Compressing %s to %s'%(fileName,opath)
def process_worker(fileList):
for x in fileList:
compressFile(x)
def split_list(tempList):
a , reList = 0, []
global num_of_proc
for x in range(num_of_proc+1):
reList.append([tempList[a:a+len(tempList)/num_of_proc]])
a = a + len(tempList)/num_of_proc
return reList
pool = mp.Pool(processes=num_of_proc)
''' Prepare a list of all the file names '''
tempList = [x for x in os.listdir(ipath)]
''' Split the list into sub-lists
For example : if I have 90 files and I am using 6 processes
each of the process will work on 15 files each '''
iterList = split_list(tempList)
''' print iterList >> [ [filename1, filename2] , [filename3,filename4], ... ] '''
''' Pass the list consisting of sub-lists to pool '''
pool.map(process_worker,iterList)
The above code ends up creating 90 processes instead of 6. Can anyone help me identify the defect in the code.

Multiprocessing will re-import the module, so as everything is top level it does it all again, and again, and again.
You need to put the code in a function and call it.
def main():
...
if __name__ == '__main__':
main()

Successive multiprocessing

I am filtering huge text files using multiprocessing.py. The code basically opens the text files, works on it, then closes it.
Thing is, I'd like to be able to launch it successively on multiple text files. Hence, I tried to add a loop, but for some reason it doesn't work (while the code works on each file). I believe this is an issue with:
if __name__ == '__main__':
However, I am looking for something else. I tried to create a Launcher and a LauncherCount files like this:
LauncherCount.py:
def setLauncherCount(n):
global LauncherCount
LauncherCount = n
and,
Launcher.py:
import os
import LauncherCount
LauncherCount.setLauncherCount(0)
os.system("OrientedFilterNoLoop.py")
LauncherCount.setLauncherCount(1)
os.system("OrientedFilterNoLoop.py")
...
I import LauncherCount.py, and use LauncherCount.LauncherCount as my loop index.
Of course, this doesn't work too as it edits the variable LauncherCount.LauncherCount locally, so it won't be edited in the imported version of LauncherCount.
Is there any way to edit globally a variable in an imported file? Or, is there any way to do this in any other way? What I need is running a code multiple times, in changing one value, and without using any loop apparently.
Thanks!
Edit: Here is my main code if necessary. Sorry for the bad style ...
import multiprocessing
import config
import time
import LauncherCount
class Filter:
""" Filtering methods """
def __init__(self):
print("launching methods")
# Return the list: [Latitude,Longitude] (elements are floating point numbers)
def LatLong(self,line):
comaCount = []
comaCount.append(line.find(','))
comaCount.append(line.find(',',comaCount[0] + 1))
comaCount.append(line.find(',',comaCount[1] + 1))
Lat = line[comaCount[0] + 1 : comaCount[1]]
Long = line[comaCount[1] + 1 : comaCount[2]]
try:
return [float(Lat) , float(Long)]
except ValueError:
return [0,0]
# Return a boolean:
# - True if the Lat/Long is within the Lat/Long rectangle defined by:
# tupleFilter = (minLat,maxLat,minLong,maxLong)
# - False if not
def LatLongFilter(self,LatLongList , tupleFilter) :
if tupleFilter[0] <= LatLongList[0] <= tupleFilter[1] and
tupleFilter[2] <= LatLongList[1] <= tupleFilter[3]:
return True
else:
return False
def writeLine(self,key,line):
filterDico[key][1].write(line)
def filteringProcess(dico):
myFilter = Filter()
while True:
try:
currentLine = readFile.readline()
except ValueError:
break
if len(currentLine) ==0: # Breaks at the end of the file
break
if len(currentLine) < 35: # Deletes wrong lines (too short)
continue
LatLongList = myFilter.LatLong(currentLine)
for key in dico:
if myFilter.LatLongFilter(LatLongList,dico[key][0]):
myFilter.writeLine(key,currentLine)
###########################################################################
# Main
###########################################################################
# Open read files:
readFile = open(config.readFileList[LauncherCount.LauncherCount][1], 'r')
# Generate writing files:
pathDico = {}
filterDico = config.filterDico
# Create outputs
for key in filterDico:
output_Name = config.readFileList[LauncherCount.LauncherCount][0][:-4]
+ '_' + key +'.log'
pathDico[output_Name] = config.writingFolder + output_Name
filterDico[key] = [filterDico[key],open(pathDico[output_Name],'w')]
p = []
CPUCount = multiprocessing.cpu_count()
CPURange = range(CPUCount)
startingTime = time.localtime()
if __name__ == '__main__':
### Create and start processes:
for i in CPURange:
p.append(multiprocessing.Process(target = filteringProcess ,
args = (filterDico,)))
p[i].start()
### Kill processes:
while True:
if [p[i].is_alive() for i in CPURange] == [False for i in CPURange]:
readFile.close()
for key in config.filterDico:
config.filterDico[key][1].close()
print(key,"is Done!")
endTime = time.localtime()
break
print("Process started at:",startingTime)
print("And ended at:",endTime)

To process groups of files in sequence while working on files within a group in parallel:
#!/usr/bin/env python
from multiprocessing import Pool
def work_on(args):
"""Process a single file."""
i, filename = args
print("working on %s" % (filename,))
return i
def files():
"""Generate input filenames to work on."""
#NOTE: you could read the file list from a file, get it using glob.glob, etc
yield "inputfile1"
yield "inputfile2"
def process_files(pool, filenames):
"""Process filenames using pool of processes.
Wait for results.
"""
for result in pool.imap_unordered(work_on, enumerate(filenames)):
#NOTE: in general the files won't be processed in the original order
print(result)
def main():
p = Pool()
# to do "successive" multiprocessing
for filenames in [files(), ['other', 'bunch', 'of', 'files']]:
process_files(p, filenames)
if __name__=="__main__":
main()
Each process_file() is called in sequence after the previous one has been complete i.e., the files from different calls to process_files() are not processed in parallel.

Persisting hashlib state

I'd like to create a hashlib instance, update() it, then persist its state in some way. Later, I'd like to recreate the object using this state data, and continue to update() it. Finally, I'd like to get the hexdigest() of the total cumulative run of data. State persistence has to survive across multiple runs.
Example:
import hashlib
m = hashlib.sha1()
m.update('one')
m.update('two')
# somehow, persist the state of m here
#later, possibly in another process
# recreate m from the persisted state
m.update('three')
m.update('four')
print m.hexdigest()
# at this point, m.hexdigest() should be equal to hashlib.sha1().update('onetwothreefour').hextdigest()
EDIT:
I did not find a good way to do this with python in 2010 and ended up writing a small helper app in C to accomplish this. However, there are some great answers below that were not available or known to me at the time.

You can do it this way using ctypes, no helper app in C is needed:-
rehash.py
#! /usr/bin/env python
''' A resumable implementation of SHA-256 using ctypes with the OpenSSL crypto library
Written by PM 2Ring 2014.11.13
'''
from ctypes import *
SHA_LBLOCK = 16
SHA256_DIGEST_LENGTH = 32
class SHA256_CTX(Structure):
_fields_ = [
("h", c_long * 8),
("Nl", c_long),
("Nh", c_long),
("data", c_long * SHA_LBLOCK),
("num", c_uint),
("md_len", c_uint)
]
HashBuffType = c_ubyte * SHA256_DIGEST_LENGTH
#crypto = cdll.LoadLibrary("libcrypto.so")
crypto = cdll.LoadLibrary("libeay32.dll" if os.name == "nt" else "libssl.so")
class sha256(object):
digest_size = SHA256_DIGEST_LENGTH
def __init__(self, datastr=None):
self.ctx = SHA256_CTX()
crypto.SHA256_Init(byref(self.ctx))
if datastr:
self.update(datastr)
def update(self, datastr):
crypto.SHA256_Update(byref(self.ctx), datastr, c_int(len(datastr)))
#Clone the current context
def _copy_ctx(self):
ctx = SHA256_CTX()
pointer(ctx)[0] = self.ctx
return ctx
def copy(self):
other = sha256()
other.ctx = self._copy_ctx()
return other
def digest(self):
#Preserve context in case we get called before hashing is
# really finished, since SHA256_Final() clears the SHA256_CTX
ctx = self._copy_ctx()
hashbuff = HashBuffType()
crypto.SHA256_Final(hashbuff, byref(self.ctx))
self.ctx = ctx
return str(bytearray(hashbuff))
def hexdigest(self):
return self.digest().encode('hex')
#Tests
def main():
import cPickle
import hashlib
data = ("Nobody expects ", "the spammish ", "imposition!")
print "rehash\n"
shaA = sha256(''.join(data))
print shaA.hexdigest()
print repr(shaA.digest())
print "digest size =", shaA.digest_size
print
shaB = sha256()
shaB.update(data[0])
print shaB.hexdigest()
#Test pickling
sha_pickle = cPickle.dumps(shaB, -1)
print "Pickle length:", len(sha_pickle)
shaC = cPickle.loads(sha_pickle)
shaC.update(data[1])
print shaC.hexdigest()
#Test copying. Note that copy can be pickled
shaD = shaC.copy()
shaC.update(data[2])
print shaC.hexdigest()
#Verify against hashlib.sha256()
print "\nhashlib\n"
shaD = hashlib.sha256(''.join(data))
print shaD.hexdigest()
print repr(shaD.digest())
print "digest size =", shaD.digest_size
print
shaE = hashlib.sha256(data[0])
print shaE.hexdigest()
shaE.update(data[1])
print shaE.hexdigest()
#Test copying. Note that hashlib copy can NOT be pickled
shaF = shaE.copy()
shaF.update(data[2])
print shaF.hexdigest()
if __name__ == '__main__':
main()
resumable_SHA-256.py
#! /usr/bin/env python
''' Resumable SHA-256 hash for large files using the OpenSSL crypto library
The hashing process may be interrupted by Control-C (SIGINT) or SIGTERM.
When a signal is received, hashing continues until the end of the
current chunk, then the current file position, total file size, and
the sha object is saved to a file. The name of this file is formed by
appending '.hash' to the name of the file being hashed.
Just re-run the program to resume hashing. The '.hash' file will be deleted
once hashing is completed.
Written by PM 2Ring 2014.11.14
'''
import cPickle as pickle
import os
import signal
import sys
import rehash
quit = False
blocksize = 1<<16 # 64kB
blocksperchunk = 1<<8
chunksize = blocksize * blocksperchunk
def handler(signum, frame):
global quit
print "\nGot signal %d, cleaning up." % signum
quit = True
def do_hash(fname, filesize):
hashname = fname + '.hash'
if os.path.exists(hashname):
with open(hashname, 'rb') as f:
pos, fsize, sha = pickle.load(f)
if fsize != filesize:
print "Error: file size of '%s' doesn't match size recorded in '%s'" % (fname, hashname)
print "%d != %d. Aborting" % (fsize, filesize)
exit(1)
else:
pos, fsize, sha = 0, filesize, rehash.sha256()
finished = False
with open(fname, 'rb') as f:
f.seek(pos)
while not (quit or finished):
for _ in xrange(blocksperchunk):
block = f.read(blocksize)
if block == '':
finished = True
break
sha.update(block)
pos += chunksize
sys.stderr.write(" %6.2f%% of %d\r" % (100.0 * pos / fsize, fsize))
if finished or quit:
break
if quit:
with open(hashname, 'wb') as f:
pickle.dump((pos, fsize, sha), f, -1)
elif os.path.exists(hashname):
os.remove(hashname)
return (not quit), pos, sha.hexdigest()
def main():
if len(sys.argv) != 2:
print "Resumable SHA-256 hash of a file."
print "Usage:\npython %s filename\n" % sys.argv[0]
exit(1)
fname = sys.argv[1]
filesize = os.path.getsize(fname)
signal.signal(signal.SIGINT, handler)
signal.signal(signal.SIGTERM, handler)
finished, pos, hexdigest = do_hash(fname, filesize)
if finished:
print "%s %s" % (hexdigest, fname)
else:
print "sha-256 hash of '%s' incomplete" % fname
print "%s" % hexdigest
print "%d / %d bytes processed." % (pos, filesize)
if __name__ == '__main__':
main()
demo
import rehash
import pickle
sha=rehash.sha256("Hello ")
s=pickle.dumps(sha.ctx)
sha=rehash.sha256()
sha.ctx=pickle.loads(s)
sha.update("World")
print sha.hexdigest()
output
a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e
Note: I would like to thank PM2Ring for his wonderful code.

hashlib.sha1 is a wrapper around a C library so you won't be able to pickle it.
It would need to implement the __getstate__ and __setstate__ methods for Python to access its internal state
You could use a pure Python implementation of sha1 if it is fast enough for your requirements

I was facing this problem too, and found no existing solution, so I ended up writing a library that does something very similar to what Devesh Saini described: https://github.com/kislyuk/rehash. Example:
import pickle, rehash
hasher = rehash.sha256(b"foo")
state = pickle.dumps(hasher)
hasher2 = pickle.loads(state)
hasher2.update(b"bar")
assert hasher2.hexdigest() == rehash.sha256(b"foobar").hexdigest()

Hash algorithm for dynamic growing/streaming data?

You can easily build a wrapper object around the hash object which can transparently persist the data.
The obvious drawback is that it needs to retain the hashed data in full in order to restore the state - so depending on the data size you are dealing with, this may not suit your needs. But it should work fine up to some tens of MB.
Unfortunattely the hashlib does not expose the hash algorithms as proper classes, it rathers gives factory functions that construct the hash objects - so we can't properly subclass those without loading reserved symbols - a situation I'd rather avoid. That only means you have to built your wrapper class from the start, which is not such that an overhead from Python anyway.
here is a sample code that might even fill your needs:
import hashlib
from cStringIO import StringIO
class PersistentSha1(object):
def __init__(self, salt=""):
self.__setstate__(salt)
def update(self, data):
self.__data.write(data)
self.hash.update(data)
def __getattr__(self, attr):
return getattr(self.hash, attr)
def __setstate__(self, salt=""):
self.__data = StringIO()
self.__data.write(salt)
self.hash = hashlib.sha1(salt)
def __getstate__(self):
return self.data
def _get_data(self):
self.__data.seek(0)
return self.__data.read()
data = property(_get_data, __setstate__)
You can access the "data" member itself to get and set the state straight, or you can use python pickling functions:
>>> a = PersistentSha1()
>>> a
<__main__.PersistentSha1 object at 0xb7d10f0c>
>>> a.update("lixo")
>>> a.data
'lixo'
>>> a.hexdigest()
'6d6332a54574aeb35dcde5cf6a8774f938a65bec'
>>> import pickle
>>> b = pickle.dumps(a)
>>>
>>> c = pickle.loads(b)
>>> c.hexdigest()
'6d6332a54574aeb35dcde5cf6a8774f938a65bec'
>>> c.data
'lixo'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is the python zipfile module faster than C? - python

As was pointed out above, the decryption is done in python, not the decompression. So zipfile is just using a c implementation like the other two.

Related

Python multicore CSV short program, advice/help needed

Why is my multithreaded parser not multithreading?

using multiprocessing to compress large number of files

Successive multiprocessing

Persisting hashlib state

Categories

Resources