file.read() multiprocessing and the GIL - python

I've read that certain Python functions implemented in C, which I assume includes file.read(), can release the GIL while they're working and then get it back on completion and by doing so make use of multiple cores if they're available.
I'm using multiprocess to parallelize some code and currently I've got three processes, the parent, one child that reads data from a file, and one child that generates a checksum from the data passed to it by the first child process.
Now if I'm understanding this right, it seems that creating a new process to read the file as I'm currently doing is uneccessary and I should just call it in the main process. The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?
So given my function to read and pipe the data to be processed:
def read(file_path, pipe_out):
with open(file_path, 'rb') as file_:
while True:
block = file_.read(block_size)
if not block:
break
pipe_out.send(block)
pipe_out.close()
I reckon that this will definitely make use of multiple cores, but also introduces some overhead:
multiprocess.Process(target=read, args).start()
But now I'm wondering if just doing this will also use multiple cores, minus the overhead:
read(*args)
Any insights anybody has as to which one would be faster and for what reason would be much appreciated!

Okay, as came out by the comments, the actual question is:
Does (C)Python create threads on its own, and if so, how can I make use of that?
Short answer: No.
But, the reason why these C-Functions are nevertheless interesting for Python programmers is the following. By default, no two snippets of python code running in the same interpreter can execute in parallel, this is due to the evil called the Global Interpreter Lock, aka the GIL. The GIL is held whenever the interpreter is executing Python code, which implies the above statement, that no two pieces of python code can run in parallel in the same interpreter.
Nevertheless, you can still make use of multithreading in python, namely when you're doing a lot of I/O or make a lot of use of external libraries like numpy, scipy, lxml and so on, which all know about the issue and release the GIL whenever they can (i.e. whenever they do not need to interact with the python interpreter).
I hope that cleared up the issue a bit.

I think this is the main part of your question:
The question is am I understanding this right and will I get better
performance with the read kept in the main process or in a separate
one?
I assume your goal is to read and process the file as fast as possible. File reading is in any case I/O bound and not CPU bound. You cannot process data faster than you are able to read it. So file I/O clearly limits the performance of your software. You cannot increase the read data rate by using concurrent threads/processes for file reading. Also 'low level' CPython is not doing this. As long as you read the file in one process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. It is also fine if you do the file reading in the main thread as long as there are no other blocking calls that would actually slow down the file reading.

Related

Concurrent access to one file by several unrelated processes on macOS

I need to get several processes to communicate with each others on a macOS system. These processes will be spawned several times per day at different times, and I cannot predict when they will be up at the same time (if ever). These programs are in python or swift.
How can I safely allow these programs to all write to the same file?
I have explored a few different options:
I thought of using sqlite3, however I couldn't find an answer in the documentation on whether it was safe to write concurrently across processes. This question is not very definitive, old, and I would ideally like to get a more authoritative answer.
I thought of using multiprocesing as it supports locks. However, as far as I could see in the documentation, you need a meta-process that spawns the children and stays up for the duration of the longest child process. I am am fine having a meta-spawner process, but it feels wasteful to have a meta-process basically staying up all day long, just to resolve conflicting access ?
Along the lines of this, I thought of having a process that stays up all day long, and receive messages from all other processes, and is the sole responsible for writing to file. It feels a little wasteful, how worried should I be about the resource cost of having a program up all day, and doing little? Are the only thing to worry about memory footprint and CPU usage (as shown in activity monitor), or could there be other significant costs, e.g. context switching?
I have come across flock on linux, that seems to be a locking mechanism to access files, provided by the OS. This seems like a good solution, but this does not seem to exist on macOS?
Any idea to solve this requirement in a robust manner (so that I don't have to debug every other day - I now concurrency can be a pain), is most welcome!
While you are in control of the source code of all such processes, you could use flock. It will put the advisory lock on file, so the other writer will be blocked only in case he is also access the file the same way. This is OK for you, if only your processes will ever need to write to the shared file.
I've tested flock on BigSur, it is still implemented and works fine.
You can also do it in any other common manner: create temporary .lock file in the known location(this is what git does), and remove it after current writer is done with the main file; use semaphores; etc

Python, read many files and merge the results

I might be asking a very basic question but I really can't figure how to make a simple parallel application in python.
I am running my scripts on a machine with 16 cores and I would like to use all of them efficiently. I have 16 huge files to read and I would like each cpu to read one file and then merge the result.
Here I give a quick example of what I would like to do:
parameter1_glob=[]
parameter2_glob[]
do cpu in arange(0,16):
parameter1,parameter2=loadtxt('file'+str(cpu)+'.dat',unpack=True)
parameter1_glob.append(parameter1)
parameter2_glob.append(parameter2)
I think that the multiprocessing module might help but I couldn't understand how to apply it to what I want to do.
I agree with what Colin Dunklau said in his comment, this process will bottleneck on reading and writing these files, the CPU demands are minimal. Even if you had 17 dedicated drives, you wouldn't be maxing out even one core. Additionally, though I realize this is tangential to your actual question, you'll likely run into memory limitations with these "huge" files - loading 16 files into memory as arrays and then combining them into another file will almost certainly take up more memory than you have.
You may find better results looking into shell scripting this problem. In particular, GNU sort uses a memory efficient merge-sort to sort one or more files very rapidly - much faster than all but the most carefully written applications in Python or most other languages.
I would suggest avoiding any sort of multi-threading effort, it will dramatically add to the complexity, with minimal benefit. Be sure you keep as little of the file(s) in memory at a time, or you'll run out quickly. In any case, you will absolutely want to have the reading and writing running on two separate disks. The slowdown associated with reading and writing simultaneously to the same disk is tremendously painful.
Do you want to merge line by line? Sometimes coroutines are more interesting for I/O bound applications than classic multitasking. You can chain generators and coroutines for all sort of routing, merging and broadcasting. Blow your mind with this nice presentation by David Beazley.
You can use a coroutine as a sink (untested, please refer to dabeaz examples):
# A sink that just prints the lines
#coroutine
def printer():
while True:
line = (yield)
print line,
sources = [
open('file1'),
open('file2'),
open('file3'),
open('file4'),
open('file5'),
open('file6'),
open('file7'),
]
output = printer()
while sources:
for source in sources:
line = source.next()
if not line: # EOF
sources.remove(source)
source.close()
continue
output.send(line)
Assuming that the results from each file are smallish, you could do this with my package jug:
from jug import TaskGenerator
loadtxt = TaskGenerator(loadtxt)
parameter1_glob=[]
parameter2_glob[]
#TaskGenerator
def write_parameter(oname, ps):
with open(oname, 'w') as output:
for p in ps:
print >>output, p
parameter1_glob = []
parameter2_glob = []
for cpu in arange(0,16):
ps = loadtxt('file'+str(cpu)+'.dat',unpack=True)
parameter1_glob.append(ps[0])
parameter2_glob.append(ps[1])
write_parameter('output1.txt', parameter1_glob)
write_parameter('output2.txt', parameter2_glob)
Now, you can execute several jug execute jobs.

Multithreading / Multiprocessing in Python

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.
What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.
Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.

same python interpreter instance running multiple scripts simultaneously?

6-7 years ago i saw an initiative of a way to run python on tight resources env by running the interpreter only once, while allowing several scripts to use it at the same time.
the idea was bot the save the interpreter startup overhead and to save RAM.
Does something alike exists?
this question
Python: Execute multiple Scripts simultaneously from same Interpreter
doesn't address concurrency. at least the answers were about sequential running, but i need simultaneously :)
ideas?
Yes and no. Python itself uses a Global Interpreter Lock (GIL), which you can read a lot about, if you care to. To make a long story short, however, it ensures the interpreter is basically single-threaded. You can create (and run) more than one thread in your Python program, but when/if they use the Python interpreter, only one can do so at a time. If, however, you have threads running mostly code from something like SciPy or NumPy (which is native code that doesn't get interpreted) then you can run several concurrently.
Most operating systems, however, have a Copy On Write mechanism for process memory pages, which means that (as long as the code isn't modified) most of the code used by the interpreter will be shared without any extra work on your part (or the interpreter's) at all. IOW, when you run two or more copies of the interpreter, the second and subsequent will share most of the memory (at least for executable code) with the first, so resource usage will not rise (anywhere close to) linearly as you run more instances. Startup time will also be substantially reduced -- the OS has to create a new page table mapping the memory pages to the new process, but does not need to reread those pages from disk or anything like that.
Python supports threading via the thread and threading modules (one is lowlevel, the other one highlevel).

python queue concurrency process management

The use case is as follows :
I have a script that runs a series of
non-python executables to reduce (pulsar) data. I right now use
subprocess.Popen(..., shell=True) and then the communicate function of subprocess to
capture the standard out and standard error from the non-python executables and the captured output I log using the python logging module.
The problem is: just one core of the possible 8 get used now most of the time.
I want to spawn out multiple processes each doing a part of the data set in parallel and I want to keep track of progres. It is a script / program to analyze data from a low frequencey radio telescope (LOFAR). The easier to install / manage and test the better.
I was about to build code to manage all this but im sure it must already exist in some easy library form.
The subprocess module can start multiple processes for you just fine, and keep track of them. The problem, though, is reading the output from each process without blocking any other processes. Depending on the platform there's several ways of doing this: using the select module to see which process has data to be read, setting the output pipes non-blocking using the fnctl module, using threads to read each process's data (which subprocess.Popen.communicate itself uses on Windows, because it doesn't have the other two options.) In each case the devil is in the details, though.
Something that handles all this for you is Twisted, which can spawn as many processes as you want, and can call your callbacks with the data they produce (as well as other situations.)
Maybe Celery will serve your needs.
If I understand correctly what you are doing, I might suggest a slightly different approach. Try establishing a single unit of work as a function and then layer on the parallel processing after that. For example:
Wrap the current functionality (calling subprocess and capturing output) into a single function. Have the function create a result object that can be returned; alternatively, the function could write out to files as you see fit.
Create an iterable (list, etc.) that contains an input for each chunk of data for step 1.
Create a multiprocessing Pool and then capitalize on its map() functionality to execute your function from step 1 for each of the items in step 2. See the python multiprocessing docs for details.
You could also use a worker/Queue model. The key, I think, is to encapsulate the current subprocess/output capture stuff into a function that does the work for a single chunk of data (whatever that is). Layering on the parallel processing piece is then quite straightforward using any of several techniques, only a couple of which were described here.

Categories

Resources