Asynchronous programming for calculating hashes of files - python

I'm trying to calculate hash for files to check if any changes are made.
i have Gui and some other observers running in the event loop.
So, i decided to calculate hash of files [md5/Sha1 which ever is faster] asynchronously.
Synchronous code :
import hashlib
import time
chunk_size = 4 * 1024
def getHash(filename):
md5_hash = hashlib.md5()
with open(filename, "rb") as f:
for byte_block in iter(lambda: f.read(chunk_size), b""):
md5_hash.update(byte_block)
print("getHash : " + md5_hash.hexdigest())
start = time.time()
getHash("C:\\Users\\xxx\\video1.mkv")
getHash("C:\\Users\\xxx\\video2.mkv")
getHash("C:\\Users\\xxx\\video3.mkv")
end = time.time()
print(end - start)
Output of synchronous code : 2.4000535011291504
Asynchronous code :
import hashlib
import aiofiles
import asyncio
import time
chunk_size = 4 * 1024
async def get_hash_async(file_path: str):
async with aiofiles.open(file_path, "rb") as fd:
md5_hash = hashlib.md5()
while True:
chunk = await fd.read(chunk_size)
if not chunk:
break
md5_hash.update(chunk)
print("get_hash_async : " + md5_hash.hexdigest())
async def check():
start = time.time()
t1 = get_hash_async("C:\\Users\\xxx\\video1.mkv")
t2 = get_hash_async("C:\\Users\\xxx\\video2.mkv")
t3 = get_hash_async("C:\\Users\\xxx\\video3.mkv")
await asyncio.gather(t1,t2,t3)
end = time.time()
print(end - start)
loop = asyncio.get_event_loop()
loop.run_until_complete(check())
Output of asynchronous code : 27.957366943359375
am i doing it right? or, are there any changes to be made to improve the performance of the code?
Thanks in advance.

In sync case, you read files sequentially. It's faster to read a file by chunks sequentially.
In async case, your event loop blocks when it's calculating the hash. That's why only one hash can be calculated at the same time. What do the terms “CPU bound” and “I/O bound” mean?
If you want to increase the calculating speed, you need to use threads. Threads can be executed on CPU in parallel. Increasing CHUNK_SIZE should also help.
import hashlib
import os
import time
from pathlib import Path
from multiprocessing.pool import ThreadPool
CHUNK_SIZE = 1024 * 1024
def get_hash(filename):
md5_hash = hashlib.md5()
with open(filename, "rb") as f:
while True:
chunk = f.read(CHUNK_SIZE)
if not chunk:
break
md5_hash.update(chunk)
return md5_hash
if __name__ == '__main__':
directory = Path("your_dir")
files = [path for path in directory.iterdir() if path.is_file()]
number_of_workers = os.cpu_count()
start = time.time()
with ThreadPool(number_of_workers) as pool:
files_hash = pool.map(get_hash, files)
end = time.time()
print(end - start)
In case of calculating hash for only 1 file: aiofiles uses a thread for each file. OS needs time to create a thread.

Related

Python Multi Processing printing statement multiple times

I've written this code that downloads a file from the internet and saves it to my computer.
To make it more efficient, I added MultiProcessing to my code to be able to download multiple files at the same time and it works, However, It keeps printing the progressbar I added again and again.
What I want is for the progress bars to display once and keep updating, like they would before the Multi Processing functionality is added. I've added my code below to reproduce.
from multiprocessing import Process
from alive_progress import alive_bar
import requests
import time
import os
def download(url):
curr_dir = os.getcwd()
x = requests.head(url)
y = requests.head(x.headers['Location'])
file_size = int(int(y.headers['content-length']) / 1024)
chunk_size = 1024
def compute():
response = requests.get(url, stream=True)
with open(curr_dir + '\\' + str(time.time()) + '.mp4', 'wb') as f:
for chunk in response.iter_content(chunk_size=chunk_size):
f.write(chunk)
yield 1024
with alive_bar(file_size, bar='classic2', spinner='classic') as bar:
for i in compute():
bar()
print("Downloaded!")
if __name__ == '__main__':
processess = []
num_processess = 2
for i in num_processess:
process = Process(target=download, args=(links[i],))
processess.append(process)
for process in processess:
process.start()
for process in processess:
process.join()
Alive-progress doesn't support showing and updating multiple progress bars. You have to use another library, such as the tqdm.
The following is an example of using the tqdm for your scenario. The key point is to call the tqdm.set_lock() to specify a synchronization mechanism for inter-process interaction and control positions of progress bars via the position argument of tqdm().
import multiprocessing
import tqdm
def download(url, id, tqdm_lock):
...
tqdm.tqdm.set_lock(tqdm_lock)
with tqdm.tqdm(total=file_size, position=id) as bar:
for i in compute():
bar.update(1)
bar.clear()
...
if __name__ == '__main__':
tqdm_lock = multiprocessing.RLock()
processess = []
num_processess = 2
links = [...]
for i in num_processess:
process = Process(target=download, args=(links[i], i, tqdm_lock))
processess.append(process)
for process in processess:
process.start()
for process in processess:
process.join()
Update 2
If you want multiple progress bars, then I would use package tqdm.
This is how I would approach it:
First find out for each URL how many CHUNK_SIZE chunks there are. CHUNK_SIZE is set at 1024, but consider increasing this for large files. A potential issue is that the 'content-length' header key is not always present. In this case, the URL is considered to consist of a single chunk and the progress bar created will be updated only once when the entire file has been downloaded..
Then each submitted task creates a progress bar whose size is the number of chunks it retrieved in step 1 and designated for a specific position based on its task number. Then the chunks are retrieved and the progress bar is updated. The logic is predicated on the file being retrieved never varying in size when the content-length key is present in the fetched header. That is, the size of the file does not change between the head and get requests being issued so that the progress bar size set from the head command will match the actual number of chunks read when the download is done.
In the code below I have commented out specific code pertaining to the writing of downloaded files to disk and have gotten rid of the compute generator function, which now seems to be unnecessary. I have also added a delay between successive fetching of chunks so that the progress bar does not progress too fast:
import requests
from tqdm import tqdm
CHUNK_SIZE = 1024
def get_number_of_chunks(url):
r = requests.head(url, allow_redirects=True)
headers = r.headers
if 'content-length' in headers:
n_chunks, remainder = divmod(int(headers['content-length']), CHUNK_SIZE)
if remainder:
n_chunks += 1
else:
n_chunks = 1
return n_chunks
def download(task_number, url):
n_chunks = get_number_of_chunks(url)
response = requests.get(url, stream=True)
#with open(str(time.time()) + '.mp4', 'wb') as f:
if True:
with tqdm(total=n_chunks, position=task_number) as bar:
for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
#f.write(chunk)
if n_chunks != 1:
bar.update(1)
# For demo purposes:
import time
time.sleep(.1)
if n_chunks == 1:
bar.update(1)
if __name__ == '__main__':
from multiprocessing.pool import ThreadPool
links = [
'http://localhost/friends/images/nav.png',
'http://localhost/friends/images/race.jpg',
]
n_writers = len(links)
pool = ThreadPool(n_writers)
pool.starmap(download, enumerate(links))
pool.close()
pool.join()
Multiprocessing Version
If you must use multiprocessing, then thanks to relent95, who showed the way:
import requests
from tqdm import tqdm
CHUNK_SIZE = 1024
def init_pool_processes(lock):
"""
Note: The lock only needs to be set once for each pool process.
"""
tqdm.set_lock(lock)
def get_number_of_chunks(url):
r = requests.head(url, allow_redirects=True)
headers = r.headers
if 'content-length' in headers:
n_chunks, remainder = divmod(int(headers['content-length']), CHUNK_SIZE)
if remainder:
n_chunks += 1
else:
n_chunks = 1
return n_chunks
def download(task_number, url):
n_chunks = get_number_of_chunks(url)
response = requests.get(url, stream=True)
#with open(str(time.time()) + '.mp4', 'wb') as f:
if True:
with tqdm(total=n_chunks, position=task_number) as bar:
for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
#f.write(chunk)
if n_chunks != 1:
bar.update(1)
# For demo purposes:
import time
time.sleep(.1)
if n_chunks == 1:
bar.update(1)
if __name__ == '__main__':
from multiprocessing import Pool, Lock
links = [
'http://localhost/friends/images/nav.png',
'http://localhost/friends/images/race.jpg',
]
n_writers = len(links)
pool = Pool(n_writers, initializer=init_pool_processes, initargs=(Lock(),))
pool.starmap(download, enumerate(links))
pool.close()
pool.join()

Threading takes longer than non threaded program

I'm working on the below code. The task is to generate 1,000,000 random numbers and save these to a .txt-file. The code I'm using without any threading needs about 2.5 seconds to run, but the threading version needs anywhere between 120 - 280 seconds. I have no clue where and why it goes wrong. Are the locks inhibiting each other?
Simple version
import random
import timeit
import os
start = timeit.default_timer()
f = open('file2.txt', 'a')
for i in range(0,1000000):
f.write(str(random.randint(0,32576)))
f.write('\n')
f.close()
end = timeit.default_timer()
Threading version
import threading
import queue
import random
import timeit
import os
tasks = queue.Queue()
start = timeit.default_timer()
output = open('file2.txt', 'w')
output_lock = threading.Lock()
def worker(thread_number):
while not tasks.empty():
tasks.get()
f = open('file2.txt','a')
with output_lock: # block until lock is available
f.write(str(random.randint(0,32767)))
f.write('\n')
f.close()
tasks.task_done()
for i in range(1000000):
tasks.put(i)
for thread in range(8):
threading.Thread(target=worker, args=(thread,)).start()
print('waiting')
tasks.join()

How should I clean up memory using SharedMemory in python 3.8? Close() and unlink() methods do not work

I have the following code where I have a consumer and a producer. The producer sends a batch of 4k images to the consumer. In practice a have multiple consumers and my intuition says that using shared memory must be the most efficient way to transfer these images. The problem is that the following code seems to allocate memory without cleaning it up.
I warn you to run this code with caution if you have a small RAM memory.
import multiprocessing
import time
import cv2 as cv
import numpy as np
from multiprocessing.context import Process
from multiprocessing import shared_memory
from multiprocessing.dummy import freeze_support
batch_size = 10
def create_shared_memory(images, sm_name):
shm = shared_memory.SharedMemory(name=sm_name, create=True, size=images.nbytes)
np_array = np.ndarray(images.shape, dtype=np.uint8, buffer=shm.buf)
np_array[:] = images[:]
return shm
def consume_images(batch_names_queue):
while True:
batch_name = batch_names_queue.get()
start = time.time()
existing_shm = shared_memory.SharedMemory(name=batch_name)
_ = np.ndarray((batch_size, 2160, 3840, 3), dtype=np.uint8, buffer=existing_shm.buf)
existing_shm.close()
existing_shm.unlink()
end = time.time()
print("reading shared memory time " + str(end - start))
def put_images(batch_names_queue, batch_images):
index = 0
while True:
index += 1
name = str(index)
start = time.time()
existing_shm = create_shared_memory(batch_images, name)
batch_names_queue.put(name)
end = time.time()
print("creating shared memory time " + str(end - start))
if __name__ == '__main__':
freeze_support()
image = cv.imread("./4k.jpg")
batch_images = np.stack([image] * batch_size, axis=0)
batch_names_queue = multiprocessing.Queue(maxsize=1)
produce = Process(target=put_images, args=(batch_names_queue, batch_images,))
produce.start()
consume = Process(target=consume_images, args=(batch_names_queue,))
consume.start()
while True:
time.sleep(100)

How to spin a new thread if data is available in a queue to process.

I have a function that zip streams data into a bytebuffer, from that bytebuffer I create 5000lines/chunks, now I am trying to write these chunks back to s3 bucket in separate files, since I am using AWS Lambda I have cannot let single thread handle all the workflow as there 5 minute constraint after which AWS Lambda times out, coming from Java background where threads are pretty simple to implement but in python I am getting confused how to execute pool of thread to take care of uploading file to s3 part of my process, here is my code:
import io
import zipfile
import boto3
import sys
import multiprocessing
# from multiprocessing.dummy import Pool as ThreadPool
import time
s3_client = boto3.client('s3')
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
# pool = ThreadPool(threads)
start_time_main = time.time()
start_time_stream = time.time()
obj = s3.Object(
bucket_name='monkey-business-dev-data',
key='sample-files/daily/banana/large/banana.zip'
)
end_time_stream = time.time()
# process_queue = multiprocessing.Queue()
buffer = io.BytesIO(obj.get()["Body"].read())
output = io.BytesIO()
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
file_clounter = 0
for line in foo2:
line_counter += 1
output.write(line)
if line_counter >= 5000:
file_clounter += 1
line_counter = 0
# pool.map(upload_to_s3, (output, file_clounter))
# upload_to_s3(output, file_clounter)
# process_queue.put(output)
output.close()
output = io.BytesIO()
if line_counter > 0:
# process_queue.put(output)
# upload_to_s3(output, file_clounter)
# pool.map(upload_to_s3, args =(output, file_clounter))
output.close()
print('Total Files: {}'.format(file_clounter))
print('Total Lines: {}'.format(line_counter))
output.seek(0)
start_time_upload = time.time()
end_time_upload = time.time()
output.close()
z.close()
end_time_main = time.time()
print('''
main: {}
stream: {}
upload: {}
'''.format((end_time_main-start_time_main),(end_time_stream-start_time_stream),(end_time_upload-start_time_upload)))
def upload_to_s3(output, file_name):
output.seek(0)
s3_client.put_object(
Bucket='monkey-business-dev-data', Key='sample-files/daily/banana/large/{}.txt'.format(file_name),
ServerSideEncryption='AES256',
Body=output,
ACL='bucket-owner-full-control'
)
# consumer_process = multiprocessing.Process(target=data_consumer, args=(process_queue))
# consumer_process.start()
#
#
# def data_consumer(queue):
# while queue.empty() is False:
if __name__ == '__main__':
stream_zip_file()
Now I have tried several ways to do it, my specific requirement is to have threadpool with size of 10 threads and these threads would always pool a queue, if chunk is available to upload on queue thread would execute and start uploading the chunk meanwhile one thread would always continuously pool the queue for new chunk and if chunk gets available a new thread (if thread 1 is still busy in s3 upload) will automatically start and upload the file to s3 and so on. I have checked many answers here and on google but nothing seems to work or make sense to my feeble mind.

Why write to a file is faster than mutiprocessing.Pipe?

i am test the fastest way between two process. i got two process, one write data, one receive data. my script show write and read from a file is fater than pipe. How can this happen? memory is faster than disk??
write and read from file:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from mutiprocesscomunicate import gen_data
data_size = 128 * 1024 # KB
def send_data_task(file_name):
with open(file_name, 'wb+') as fd:
for i in range(data_size):
fd.write(gen_data(1))
fd.write('\n'.encode('ascii'))
# end EOF
fd.write('EOF'.encode('ascii'))
print('send done.')
def get_data_task(file_name):
offset = 0
fd = open(file_name, 'r+')
i = 0
while True:
data = fd.read(1024)
offset += len(data)
if 'EOF' in data:
fd.truncate()
break
if not data:
fd.close()
fd = None
fd = open(file_name, 'r+')
fd.seek(offset)
continue
print("recv done.")
if __name__ == '__main__':
import multiprocessing
pipe_out = pipe_in = 'throught_file'
p = multiprocessing.Process(target=send_data_task, args=(pipe_out,), kwargs=())
p1 = multiprocessing.Process(target=get_data_task, args=(pipe_in,), kwargs=())
p.daemon = True
p1.daemon = True
import time
start_time = time.time()
p1.start()
import time
time.sleep(0.5)
p.start()
p.join()
p1.join()
import os
os.sync()
print('through file', data_size / (time.time() - start_time), 'KB/s')
open(pipe_in, 'w+').truncate()
use pipe
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import multiprocessing
from mutiprocesscomunicate import gen_data
data_size = 128 * 1024 # KB
def send_data_task(pipe_out):
for i in range(data_size):
pipe_out.send(gen_data(1))
# end EOF
pipe_out.send("")
print('send done.')
def get_data_task(pipe_in):
while True:
data = pipe_in.recv()
if not data:
break
print("recv done.")
if __name__ == '__main__':
pipe_out, pipe_in = multiprocessing.Pipe()
p = multiprocessing.Process(target=send_data_task, args=(pipe_out,), kwargs=())
p1 = multiprocessing.Process(target=get_data_task, args=(pipe_in,), kwargs=())
p.daemon = True
p1.daemon = True
import time
start_time = time.time()
p1.start()
p.start()
p.join()
p1.join()
print('through pipe', data_size / (time.time() - start_time), 'KB/s')
create data function:
def gen_data(size):
onekb = "a" * 1024
return (onekb * size).encode('ascii')
result:
through file 110403.02025891568 KB/s
through pipe 75354.71358973449 KB/s
i use Mac os with python3.
update
if data is just 1kb, pipe is 100 faster than file. but if date if big, like 128MB result is above.
A pipe has a limited capacity, in order to match speeds of producer and consumer (via back pressure flow control) rather than consume an unlimited amount of memory. The particular limit on OS X, according to this Unix stack exchange answer, is 16KiB. As you're writing 128KiB, this means 8 times as many system calls (and context switches), at least. When working with files, the size is limited by your disk space or quota only, and without a fdatasync or similar, it won't need to make it to disk; it can be read again directly from cache. On the other hand, when your data is small, the time to find a place to put the file dominates leaving the pipe far faster.
When you do use fdatasync, or just exceed the available memory for disk caching, writing to disk also slows down to match actual disk transfer speeds.
Because quite often file data is first written into the page cache (which is in RAM) by the OS kernel.

Categories

Resources