Why write to a file is faster than mutiprocessing.Pipe? - python

i am test the fastest way between two process. i got two process, one write data, one receive data. my script show write and read from a file is fater than pipe. How can this happen? memory is faster than disk??
write and read from file:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from mutiprocesscomunicate import gen_data
data_size = 128 * 1024 # KB
def send_data_task(file_name):
with open(file_name, 'wb+') as fd:
for i in range(data_size):
fd.write(gen_data(1))
fd.write('\n'.encode('ascii'))
# end EOF
fd.write('EOF'.encode('ascii'))
print('send done.')
def get_data_task(file_name):
offset = 0
fd = open(file_name, 'r+')
i = 0
while True:
data = fd.read(1024)
offset += len(data)
if 'EOF' in data:
fd.truncate()
break
if not data:
fd.close()
fd = None
fd = open(file_name, 'r+')
fd.seek(offset)
continue
print("recv done.")
if __name__ == '__main__':
import multiprocessing
pipe_out = pipe_in = 'throught_file'
p = multiprocessing.Process(target=send_data_task, args=(pipe_out,), kwargs=())
p1 = multiprocessing.Process(target=get_data_task, args=(pipe_in,), kwargs=())
p.daemon = True
p1.daemon = True
import time
start_time = time.time()
p1.start()
import time
time.sleep(0.5)
p.start()
p.join()
p1.join()
import os
os.sync()
print('through file', data_size / (time.time() - start_time), 'KB/s')
open(pipe_in, 'w+').truncate()
use pipe
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import multiprocessing
from mutiprocesscomunicate import gen_data
data_size = 128 * 1024 # KB
def send_data_task(pipe_out):
for i in range(data_size):
pipe_out.send(gen_data(1))
# end EOF
pipe_out.send("")
print('send done.')
def get_data_task(pipe_in):
while True:
data = pipe_in.recv()
if not data:
break
print("recv done.")
if __name__ == '__main__':
pipe_out, pipe_in = multiprocessing.Pipe()
p = multiprocessing.Process(target=send_data_task, args=(pipe_out,), kwargs=())
p1 = multiprocessing.Process(target=get_data_task, args=(pipe_in,), kwargs=())
p.daemon = True
p1.daemon = True
import time
start_time = time.time()
p1.start()
p.start()
p.join()
p1.join()
print('through pipe', data_size / (time.time() - start_time), 'KB/s')
create data function:
def gen_data(size):
onekb = "a" * 1024
return (onekb * size).encode('ascii')
result:
through file 110403.02025891568 KB/s
through pipe 75354.71358973449 KB/s
i use Mac os with python3.
update
if data is just 1kb, pipe is 100 faster than file. but if date if big, like 128MB result is above.

A pipe has a limited capacity, in order to match speeds of producer and consumer (via back pressure flow control) rather than consume an unlimited amount of memory. The particular limit on OS X, according to this Unix stack exchange answer, is 16KiB. As you're writing 128KiB, this means 8 times as many system calls (and context switches), at least. When working with files, the size is limited by your disk space or quota only, and without a fdatasync or similar, it won't need to make it to disk; it can be read again directly from cache. On the other hand, when your data is small, the time to find a place to put the file dominates leaving the pipe far faster.
When you do use fdatasync, or just exceed the available memory for disk caching, writing to disk also slows down to match actual disk transfer speeds.

Because quite often file data is first written into the page cache (which is in RAM) by the OS kernel.

Related

Python: Why reading file in multiple parallel processes is slower than in 1 single process/thread?

What could be the reason that Python multiprocessing is slower that a single thread while reading binary files?
def getBinaryData(procnum, filename, pointer_from, pointer_to):
binary_values = []
start = time.time()
with open(filename, 'rb') as fileobject:
# read file byte by byte
fileobject.seek(pointer_from)
data = fileobject.read(1)
while data != b'' or pointer_position < pointer_to:
#binary_values.append(ord(data))
data = fileobject.read(1)
pointer_position = fileobject.tell()
end = time.time()
print("proc ", procnum, " finished in: ", end - start)
return binary_values
def worker(procnum, last_proc_num, file_path, bytes_chunk, return_dict):
"""worker function"""
print(str(procnum) + " represent!")
if procnum == 0:
greyscale_data = getBinaryData(procnum, file_path, 0, bytes_chunk)
elif procnum == last_proc_num:
greyscale_data = getBinaryData(procnum, file_path, procnum * bytes_chunk, os.stat(file_path).st_size)
else:
greyscale_data = getBinaryData(procnum, file_path, procnum * bytes_chunk, (procnum+1) * bytes_chunk)
size = get_size(len(greyscale_data))
return_dict[procnum] = procnum
def main():
cpu_cores = 10
file_path = r"test_binary_file.exe"
file_stats = os.stat(file_path)
file_size = file_stats.st_size
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
for i in range(cpu_cores):
p = multiprocessing.Process(target=worker, args=(i, cpu_cores-1, file_path, int(file_size/cpu_cores), return_dict))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print(return_dict.values())
While single-threaded process finishes to read 10mb file in ~30seconds - the multiprocesses solution gets it done way slower.
Python log output:
10 processes
1 process
Ruled-out issues:
IO bottleneck (NVMe SSD)
CPU/RAM bottleneck (16 cores, 4.4 GHz / 64GB 3200GHz RAM)
Processes are heavy and it takes a lot of time to creates and ends a process so in my opinion reading the file is really fast but most of the time it takes to create and terminate the process.
To reads a lot of file it's enought to use multithreading because thread are light and GIL works like True multithreading for i/o operations.
It's recommended to use multiprocessing when you need to execute heavy operations.
Source of picture: https://youtu.be/kRy_UwUhBpo?t=763
He told that the img is from fastpython.com.

How should I clean up memory using SharedMemory in python 3.8? Close() and unlink() methods do not work

I have the following code where I have a consumer and a producer. The producer sends a batch of 4k images to the consumer. In practice a have multiple consumers and my intuition says that using shared memory must be the most efficient way to transfer these images. The problem is that the following code seems to allocate memory without cleaning it up.
I warn you to run this code with caution if you have a small RAM memory.
import multiprocessing
import time
import cv2 as cv
import numpy as np
from multiprocessing.context import Process
from multiprocessing import shared_memory
from multiprocessing.dummy import freeze_support
batch_size = 10
def create_shared_memory(images, sm_name):
shm = shared_memory.SharedMemory(name=sm_name, create=True, size=images.nbytes)
np_array = np.ndarray(images.shape, dtype=np.uint8, buffer=shm.buf)
np_array[:] = images[:]
return shm
def consume_images(batch_names_queue):
while True:
batch_name = batch_names_queue.get()
start = time.time()
existing_shm = shared_memory.SharedMemory(name=batch_name)
_ = np.ndarray((batch_size, 2160, 3840, 3), dtype=np.uint8, buffer=existing_shm.buf)
existing_shm.close()
existing_shm.unlink()
end = time.time()
print("reading shared memory time " + str(end - start))
def put_images(batch_names_queue, batch_images):
index = 0
while True:
index += 1
name = str(index)
start = time.time()
existing_shm = create_shared_memory(batch_images, name)
batch_names_queue.put(name)
end = time.time()
print("creating shared memory time " + str(end - start))
if __name__ == '__main__':
freeze_support()
image = cv.imread("./4k.jpg")
batch_images = np.stack([image] * batch_size, axis=0)
batch_names_queue = multiprocessing.Queue(maxsize=1)
produce = Process(target=put_images, args=(batch_names_queue, batch_images,))
produce.start()
consume = Process(target=consume_images, args=(batch_names_queue,))
consume.start()
while True:
time.sleep(100)

Asynchronous programming for calculating hashes of files

I'm trying to calculate hash for files to check if any changes are made.
i have Gui and some other observers running in the event loop.
So, i decided to calculate hash of files [md5/Sha1 which ever is faster] asynchronously.
Synchronous code :
import hashlib
import time
chunk_size = 4 * 1024
def getHash(filename):
md5_hash = hashlib.md5()
with open(filename, "rb") as f:
for byte_block in iter(lambda: f.read(chunk_size), b""):
md5_hash.update(byte_block)
print("getHash : " + md5_hash.hexdigest())
start = time.time()
getHash("C:\\Users\\xxx\\video1.mkv")
getHash("C:\\Users\\xxx\\video2.mkv")
getHash("C:\\Users\\xxx\\video3.mkv")
end = time.time()
print(end - start)
Output of synchronous code : 2.4000535011291504
Asynchronous code :
import hashlib
import aiofiles
import asyncio
import time
chunk_size = 4 * 1024
async def get_hash_async(file_path: str):
async with aiofiles.open(file_path, "rb") as fd:
md5_hash = hashlib.md5()
while True:
chunk = await fd.read(chunk_size)
if not chunk:
break
md5_hash.update(chunk)
print("get_hash_async : " + md5_hash.hexdigest())
async def check():
start = time.time()
t1 = get_hash_async("C:\\Users\\xxx\\video1.mkv")
t2 = get_hash_async("C:\\Users\\xxx\\video2.mkv")
t3 = get_hash_async("C:\\Users\\xxx\\video3.mkv")
await asyncio.gather(t1,t2,t3)
end = time.time()
print(end - start)
loop = asyncio.get_event_loop()
loop.run_until_complete(check())
Output of asynchronous code : 27.957366943359375
am i doing it right? or, are there any changes to be made to improve the performance of the code?
Thanks in advance.
In sync case, you read files sequentially. It's faster to read a file by chunks sequentially.
In async case, your event loop blocks when it's calculating the hash. That's why only one hash can be calculated at the same time. What do the terms “CPU bound” and “I/O bound” mean?
If you want to increase the calculating speed, you need to use threads. Threads can be executed on CPU in parallel. Increasing CHUNK_SIZE should also help.
import hashlib
import os
import time
from pathlib import Path
from multiprocessing.pool import ThreadPool
CHUNK_SIZE = 1024 * 1024
def get_hash(filename):
md5_hash = hashlib.md5()
with open(filename, "rb") as f:
while True:
chunk = f.read(CHUNK_SIZE)
if not chunk:
break
md5_hash.update(chunk)
return md5_hash
if __name__ == '__main__':
directory = Path("your_dir")
files = [path for path in directory.iterdir() if path.is_file()]
number_of_workers = os.cpu_count()
start = time.time()
with ThreadPool(number_of_workers) as pool:
files_hash = pool.map(get_hash, files)
end = time.time()
print(end - start)
In case of calculating hash for only 1 file: aiofiles uses a thread for each file. OS needs time to create a thread.

How to spin a new thread if data is available in a queue to process.

I have a function that zip streams data into a bytebuffer, from that bytebuffer I create 5000lines/chunks, now I am trying to write these chunks back to s3 bucket in separate files, since I am using AWS Lambda I have cannot let single thread handle all the workflow as there 5 minute constraint after which AWS Lambda times out, coming from Java background where threads are pretty simple to implement but in python I am getting confused how to execute pool of thread to take care of uploading file to s3 part of my process, here is my code:
import io
import zipfile
import boto3
import sys
import multiprocessing
# from multiprocessing.dummy import Pool as ThreadPool
import time
s3_client = boto3.client('s3')
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
# pool = ThreadPool(threads)
start_time_main = time.time()
start_time_stream = time.time()
obj = s3.Object(
bucket_name='monkey-business-dev-data',
key='sample-files/daily/banana/large/banana.zip'
)
end_time_stream = time.time()
# process_queue = multiprocessing.Queue()
buffer = io.BytesIO(obj.get()["Body"].read())
output = io.BytesIO()
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
file_clounter = 0
for line in foo2:
line_counter += 1
output.write(line)
if line_counter >= 5000:
file_clounter += 1
line_counter = 0
# pool.map(upload_to_s3, (output, file_clounter))
# upload_to_s3(output, file_clounter)
# process_queue.put(output)
output.close()
output = io.BytesIO()
if line_counter > 0:
# process_queue.put(output)
# upload_to_s3(output, file_clounter)
# pool.map(upload_to_s3, args =(output, file_clounter))
output.close()
print('Total Files: {}'.format(file_clounter))
print('Total Lines: {}'.format(line_counter))
output.seek(0)
start_time_upload = time.time()
end_time_upload = time.time()
output.close()
z.close()
end_time_main = time.time()
print('''
main: {}
stream: {}
upload: {}
'''.format((end_time_main-start_time_main),(end_time_stream-start_time_stream),(end_time_upload-start_time_upload)))
def upload_to_s3(output, file_name):
output.seek(0)
s3_client.put_object(
Bucket='monkey-business-dev-data', Key='sample-files/daily/banana/large/{}.txt'.format(file_name),
ServerSideEncryption='AES256',
Body=output,
ACL='bucket-owner-full-control'
)
# consumer_process = multiprocessing.Process(target=data_consumer, args=(process_queue))
# consumer_process.start()
#
#
# def data_consumer(queue):
# while queue.empty() is False:
if __name__ == '__main__':
stream_zip_file()
Now I have tried several ways to do it, my specific requirement is to have threadpool with size of 10 threads and these threads would always pool a queue, if chunk is available to upload on queue thread would execute and start uploading the chunk meanwhile one thread would always continuously pool the queue for new chunk and if chunk gets available a new thread (if thread 1 is still busy in s3 upload) will automatically start and upload the file to s3 and so on. I have checked many answers here and on google but nothing seems to work or make sense to my feeble mind.

Dynamic refresh printing of multiprocessing or multithreading in Python

I have implemented a multiprocessing downloader.
How can I print the status bar (complete rate, download speed) which can refresh automatically
in different part on the terminal.
Like this:
499712 [6.79%] 68k/s // keep refreshing
122712 [16.79%] 42k/s // different process/thread
99712 [56.32%] 10k/s
code:
download(...)
...
f = open(tmp_file_path, 'wb')
print "Downloading: %s Bytes: %s" % (self.file_name, self.file_size)
file_size_dl = 0
block_sz = 8192
start_time = time.time()
while True:
buffer = self.opening.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
end_time = time.time()
cost_time = end_time - start_time
if cost_time == 0:
cost_time = 1
status = "\r%10d [%3.2f%%] %3dk/s" % (file_size_dl,
file_size_dl * 100. / self.file_size,
file_size_dl * 100. / 1024 / 1024 / cost_time)
print status,
sys.stdout.flush()
f.close()
DownloadProcess inherits Process class and trigger the download method.
I use queue to store the url. Here is starting process
...
for i in range(3):
t = DownloadProcess(queue)
t.start()
for url in urls:
queue.put(url)
queue.join()
Below is a demo that has implemented both multi-processing and multi-threading. To try one or the other just uncomment the import lines at the top of the code. If you have a progress bar on a single line then you can use the technique that you have of printing '\r' to move the cursor back to the start of the line. But if you want to have multi-line progress bars then you are going to have to get a little fancier. I just cleared the screen each time I wanted to print the progress bars. Check out the article console output on Unix in Python it helped me a great deal in producing the code below. He shows both techniques. You can also give the curses library that is part of python standard library a shot. The question Multiline progress bars asks a similar thing. The main thread/process spawns the child threads that do the work and communicate their progress back to the main thread using a queue. I highly recommend using queues for inter-process/thread communication. The main thread then displays the progress and waits for all children to end execution before exiting itself.
code
import time, random, sys, collections
from multiprocessing import Process as Task, Queue
#from threading import Thread as Task
#from Queue import Queue
def download(status, filename):
count = random.randint(5, 30)
for i in range(count):
status.put([filename, (i+1.0)/count])
time.sleep(0.1)
def print_progress(progress):
sys.stdout.write('\033[2J\033[H') #clear screen
for filename, percent in progress.items():
bar = ('=' * int(percent * 20)).ljust(20)
percent = int(percent * 100)
sys.stdout.write("%s [%s] %s%%\n" % (filename, bar, percent))
sys.stdout.flush()
def main():
status = Queue()
progress = collections.OrderedDict()
workers = []
for filename in ['test1.txt', 'test2.txt', 'test3.txt']:
child = Task(target=download, args=(status, filename))
child.start()
workers.append(child)
progress[filename] = 0.0
while any(i.is_alive() for i in workers):
time.sleep(0.1)
while not status.empty():
filename, percent = status.get()
progress[filename] = percent
print_progress(progress)
print 'all downloads complete'
main()
demo

Categories

Resources