Asynchronous file IO in multi-threaded application with variable index

Asynchronous file IO in multi-threaded application with variable index - python

In one of my programs, I have the following method :
def set_part_content(self, part_no, block_no, data):
with open(self.file_path, "rwb+") as f:
f.seek(part_no * self.constant1 + block_no * self.constant2)
f.write(data)
I did this this way because of the following :
I have to write at different index (the reason why the f.seek is here)
and this function is thread safe (thanks to the with statement)
My issue is this function is called approximately 10k to 100k times, and obviously it is really really slow (it represent half of the execution time of one of my most critical set of functionality) because of the opening/closing time.
Because of the f.seek, I can't open the file directly in the __init__ function in order to operate on it (if 2 thread use the function at the same time, it result in a bad index for one of this two, which is critical).
Is there any module / way that could accelerate this function ?

I am not familiar with python, so I am not sure if with statement will make this function thread safe or not.
If it does, you won't have 2 threads using the function at the same time.
If it doesn't, your function is not thread safe. You need a lock to ensure thread safety. You may check implementing file locks using Python's with statement. Another option is to make this function async, and let single thread to handle all the writing.

Related

Why `pathlib.Path("xxx/yyy").unlink / mkdir / rmdir` are not sync operations?

I am using python's pathlib.Path module, with pytorch's DistributedDataParallel together.
When I use multi-process using DistributedDataParallel, I delete or create a file with Path("xxx/yyy").rmdir / .unlink / .mkdir only in process 0 which is local_rank: 0.
And then the weird thing happened，it seems that the Path("xxx/yyy").rmdir / .unlink / .mkdir operations is not synchronous, which i means that the process in rank0 runs but does not wait until the file opertion ends. So if there is a file check after that operation, for example Path("xxx/yyy").parent.iterdir, the results in rank0 and rank1 or other ranks are not equal, which means that rank0 finds or does not find some files that other ranks does not find or find.
The problem has been fixed by adding a synchronous lock, the code is simple and as follows:
from time import sleep
def wait_to_success(fn, no=False):
while True:
sleep(0.01)
if (no and not fn()) or (not no and fn()):
break
And this function is used as follows:
remove_file_operation(filepath)
wait_to_success(filepath.exist, no=True)
or
create_file_operation(filepath)
wait_to_success(filepath.exist, no=False)
And it is needed to notice that the torch.distributed.barrier() function does not work for this problem.
So I am very confused why pathlib.Path operation does not wait for this operation ends?

Why does implementing threading in Python cause I/O to become really slow?

I have this function that saves a large list of dictionaries into files of 100 items each. This worked flawlessly in a normal environment, but after changing nothing but running this using threading, I experienced rather significant slow-downs. To my knowledge, simply adding threading or multiprocessing shouldn't cause slowdowns in I/O, but if I am missing something trivial, please let me know, but if I am not, how can I make this not run so slowly?
def savePlayerQueueToDisk(saveCopy):
print("="*10)
print("Application exit signal received. Caching loaded players...")
import pickle
i = 0
for l in chunks(saveCopy, 100):
with open(CACHED_PLAYERS_DIR / f"{str(i)}", 'wb') as filehandle:
pickle.dump(saveCopy, filehandle)
i += 1
print(f"saved chunk {i}")
import sys
print("="*10)
sys.exit()
def mainFunction():
# Calls a main program, where on KeyboardException, it calls the savePlayerQueueToDisk function.
t2 = threading.Thread(target=main, kwargs={'ignore_exceptions': False})
t2.start()
Edit: After some more testing, by using multiprocessing instead of threading, and only using a second process for the one of the threads, and keeping the mainFunction on the main thread, I experienced no slowdowns. Why is this the case?
Edit: after more testing and debugging, I found that the issue is not actually tied to multiprocessing and I/O bounding. In fact, there is actually a logic error on line 8 of my savePlayerQueueToDisk() function. It reads pickle.dump(saveCopy, filehandle), when it should instead be pickle.dump(l, filehandle). The I/O would just get slower and slower the more I ran the function because I would save the entire list into over 100 files, and when all those files were loaded in, I would save 100 copies of each data again into over 100 files. Loading and saving these, obviously, would just get out of hand.

After more testing and debugging, I found that the issue is not actually tied to multiprocessing and I/O bounding. In fact, there is actually a logic error on line 8 of my savePlayerQueueToDisk() function. It reads pickle.dump(saveCopy, filehandle), when it should instead be pickle.dump(l, filehandle). The I/O would just get slower and slower the more I ran the function because I would save the entire list into over 100 files, and when all those files were loaded in, I would save 100 copies of each data again into over 100 files. Loading and saving these, obviously, would just get out of hand.

multiprocessing - calling function with different input files

I have a function which reads in a file, compares a record in that file to a record in another file and depending on a rule, appends a record from the file to one of two lists.
I have an empty list for adding matched results to:
match = []
I have a list restrictions that I want to compare records in a series of files with.
I have a function for reading in the file I wish to see if contains any matches. If there is a match, I append the record to the match list.
def link_match(file):
links = json.load(file)
for link in links:
found = False
try:
for other_link in other_links:
if link['data'] == other_link['data']:
match.append(link)
found = True
else:
pass
else:
print "not found"
I have numerous files that I wish to compare and I thus wish to use the multiprocessing library.
I create a list of file names to act as function arguments:
list_files=[]
for file in glob.glob("/path/*.json"):
list_files.append(file)
I then use the map feature to call the function with the different input files:
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=6)
pool.map(link_match,list_files)
pool.close()
pool.join()
CPU use goes through the roof and by adding in a print line to the function loop I can see that matches are being found and the function is behaving correctly.
However, the match results list remains empty. What am I doing wrong?

multiprocessing runs a new instance of Python for each process in the pool - the context is empty (if you use spawn as a start method) or copied (if you use fork), plus copies of any arguments you pass in (either way), and from there they're all separate. If you want to pass data between branches, there's a few other ways to do it.
Instead of writing to an internal list, write to a file and read from it later when you're done. The largest potential problem here is that only one thing can write to a file at a time, so either you make a lot of separate files (and have to read all of them afterwards) or they all block each other.
Continue with multiprocessing, but use a multiprocessing.Queue instead of a list. This is an object provided specifically for your current use-case: Using multiple processes and needing to pass data between them. Assuming that you should indeed be using multiprocessing (that your situation wouldn't be better for threading, see below), this is probably your best option.
Instead of multiprocessing, use threading. Separate threads all share a single environment. The biggest problems here are that Python only lets one thread actually run Python code at a time, per process. This is called the Global Interpreter Lock (GIL). threading is thus useful when the threads will be waiting on external processes (other programs, user input, reading or writing files), but if most of the time is spent in Python code, it actually takes longer (because it takes a little time to switch threads, and you're not doing anything to save time). This has its own queue. You should probably use that rather than a plain list, if you use threading - otherwise there's the potential that two threads accessing the list at the same time interfere with each other, if it switches threads at the wrong time.
Oh, by the way: If you do use threading, Python 3.2 and later has an improved implementation of the GIL, which seems like it at least has a good chance of helping. A lot of stuff for threading performance is very dependent on your hardware (number of CPU cores) and the exact tasks you're doing, though - probably best to try several ways and see what works for you.

When multiprocessing, each subprocess gets its own copy of any global variables in the main module defined before the if __name__ == '__main__': statement. This means that the link_match() function in each one of the processes will be accessing a different match list in your code.
One workaround is to use a shared list, which in turn requires a SyncManager to synchronize access to the shared resource among the processes (which is created by calling multiprocessing.Manager()). This is then used to create the list to store the results (which I have named matches instead of match) in the code below.
I also had to use functools.partial() to create a single argument callable out of the revised link_match function which now takes two arguments, not one (which is the kind of function pool.map() expects).
from functools import partial
import glob
import multiprocessing
def link_match(matches, file): # note: added results list argument
links = json.load(file)
for link in links:
try:
for other_link in other_links:
if link['data'] == other_link['data']:
matches.append(link)
else:
pass
else:
print "not found"
if __name__ == '__main__':
manager = multiprocessing.Manager() # create SyncManager
matches = manager.list() # create a shared list here
link_matches = partial(link_match, matches) # create one arg callable to
# pass to pool.map()
pool = multiprocessing.Pool(processes=6)
list_files = glob.glob("/path/*.json") # only used here
pool.map(link_matches, list_files) # apply partial to files list
pool.close()
pool.join()
print(matches)

Multiprocessing creates multiple processes. The context of your "match" variable will now be in that child process, not the parent Python process that kicked the processing off.
Try writing the list results out to a file in your function to see what I mean.

To expand cthrall's answer, you need to return something from your function in order to pass the info back to your main thread, e.g.
def link_match(file):
[put all the code here]
return match
[main thread]
all_matches = pool.map(link_match,list_files)
the list match will be returned from each single thread and map will return a list of lists in this case. You can then flatten it again to get the final output.
Alternatively you can use a shared list but this will just add more headache in my opinion.

Python multithreading without a queue working with large data sets

I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.
Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.
This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?
queue=Queue.Queue()
def worker():
while True:
task=queue.get()
try:
subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
docRoot=docRoot,
task=task
)],shell=True)
except:
pass
with lock:
stats['done']+=1
if int(time.time())!=stats.get('now'):
stats.update(
now=int(time.time()),
percent=(stats.get('done')/stats.get('total'))*100,
ps=(stats.get('done')/(time.time()-stats.get('start')))
)
print("\r {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
percent=stats.get('percent'),
progress=('='*int((23*stats.get('percent'))/100))+'>',
persec=stats.get('ps'),
done=int(stats.get('done')),
total=stats.get('total'),
eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
),end='')
queue.task_done()
for i in range(32):
workers=threading.Thread(target=worker)
workers.daemon=True
workers.start()
try:
with open(csvFile,'rb') as fh:
try:
dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
fh.seek(0)
reader=csv.reader(fh,dialect)
headers=reader.next()
except csv.Error as e:
print("\rERROR[CSV] {error}\n".format(error=e))
else:
while True:
try:
data=reader.next()
except csv.Error as e:
print("\rERROR[CSV] - Line {line}: {error}\n".format( line=reader.line_num, error=e))
except StopIteration:
break
else:
stats['total']+=1
queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
queue.join()

32 threads is probably overkill unless you have some humungous hardware available.
The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1
which comes to either 7 or 15 on most hardware.
The simplest way would be to start 7 threads passing each thread an "offset" as a parameter.
i.e. a number from 0 to 7.
Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.
This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.
I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.

I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.
Example:
from multiprocessing import Pool
def job(args):
# do some work
inputs = [...] # define your inputs
Pool().map(job, inputs)
I leave it up to you to fill in the blanks to meet your specific requirements.
See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.

Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:
import multiprocessing
from multiprocessing import Pool
def multiplyxy(x,y):
return x*y
def funkytuple(t):
"""
Breaks a tuple into a function to be called and a tuple
of arguments for that function. Changes that new tuple into
a series of arguments and passes those arguments to the
function.
"""
f = t[0]
t = t[1]
return f(*t)
def processparallel(func, arglist):
"""
Takes a function and a list of arguments for that function
and proccesses in parallel.
"""
parallelarglist = []
for entry in arglist:
parallelarglist.append((func, tuple(entry)))
cpu_count = int(multiprocessing.cpu_count() - 1)
pool = Pool(processes = cpu_count)
database = pool.map(funkytuple, parallelarglist)
pool.close()
return database
#Necessary on Windows
if __name__ == '__main__':
x = [23, 23, 42, 3254, 32]
y = [324, 234, 12, 425, 13]
i = 0
arglist = []
while i < len(x):
arglist.append([x[i],y[i]])
i += 1
database = processparallel(multiplyxy, arglist)
print(database)

Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?
myq = Queue.Queue(maxsize=64)
Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.
EDIT
This is current script. It appears that it is reading the
entire csv file into queue and doing a queue.join(). Is
it correct that it is loading the entire csv into a queue
then spawning the threads?
The indentation is messed up in your post, so have to guess some, but:
The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.
To get better answers, supply better info ;-)
ANOTHER EDIT
Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.
BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.
And note what #JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.

Python Threading stdin/stdout

I have a file which contains a lot of data. Each row is a record. And I am trying to do some ETL work against the whole file. Right now I am using standard input to read the data line by line. The cool thing about this is your script could be very flexible to integrate with other script and shell commands. I write the result to standard output. For example.
$ cat input_file
line1
line2
line3
line4
...
My current python code looks like this - parse.py
import sys
for line in sys.stdin:
result = ETL(line) # ETL is some self defined function which takes a while to execute.
print result
The code below is how it is working right now:
cat input_file | python parse.py > output_file
I have looked at the Threading module of Python and I am wondering if the performance would be dramatically improved if I use that module.
Question1: How should I plan the quotas for each thread, why?
...
counter = 0
buffer = []
for line in sys.stdin:
buffer.append(line)
if counter % 5 == 0: # maybe assign 5 rows to each thread? if not, is there a rule of thumb to determine
counter = 0
thread = parser(buffer)
buffer = []
thread.start()
Question2: Multiple Threads might print the result back to stdout at the same time, how to organize them and avoid the situation below?
import threading
import time
class parser(threading.Thread):
def __init__ (self, data_input):
threading.Thread.__init__(self)
self.data_input = data_input
def run(self):
for elem in self.data_input:
time.sleep(3)
print elem + 'Finished'
work = ['a', 'b', 'c', 'd', 'e', 'f']
thread1 = parser(['a', 'b'])
thread2 = parser(['c', 'd'])
thread3 = parser(['e', 'f'])
thread1.start()
thread2.start()
thread3.start()
The output is really ugly, where one row contains the outputs from two threads.
aFinished
cFinishedeFinished
bFinished
fFinished
dFinished

Taking your second question first, this is what mutexes are for. You can get the cleaner output that you want by using a lock to coordinate among the parsers and ensure that only one thread has access to the output stream during a given period of time:
class parser(threading.Thread):
output_lock = threading.Lock()
def __init__ (self, data_input):
threading.Thread.__init__(self)
self.data_input = data_input
def run(self):
for elem in self.data_input:
time.sleep(3)
with self.output_lock:
print elem + 'Finished'
As regards your first question, note that it's probably the case that multi-threading will provide no benefit for your particular workload. It largely depends on whether the work you do with each input line (your ETL function) is primarily CPU-bound or IO-bound. If the former (which I suspect is likely), threads will be of no help, because of the global interpreter lock. In that case, you would want to use the multiprocessing module to distribute work among multiple processes instead of multiple threads.
But you can get the same results with an easier to implement workflow: Split the input file into n pieces (using, e.g., the split command); invoke the extract-and-transform script separately on each subfile; then concatenate the resulting output files.
One nitpick: "using standard input to read the data line by line because it won't load the whole file into memory" involves a misconception. You can read a file line by line from within Python by, e.g., replacing sys.stdin with a file object in a construct like:
for line in sys.stdin:
See also the readline() method of file objects, and note that read() can take as parameter the maximum number of bytes to read.

Whether threading will be helpful you is highly dependent on on your situation. In particular, if your ETL() function involves a lot of disk access, then threading would likely give you pretty significant speed improvement.
In response to your first question, I've always found that it just depends. There are a lot of factors at play when determining the ideal number of threads, and many of them are program-dependent. If you're doing a lot of disk access (which is pretty slow), for example, then you'll want more threads to take advantage of the downtime while waiting for disk access. If the program is CPU-bound, though, tons of threads may not be super helpful. So, while it may be possible to analyze all the factors to come up with an ideal number of threads, it's usually a lot faster to make an initial guess and then adjust from there.
More specifically, though, assigning a certain number of lines to each thread probably isn't the best way to go about divvying up the work. Consider, for example, if one line takes a particularly long time to process. It would be best if one thread could work away at that one line and the other threads could each do a few more lines in the meantime. The best way to handle this is to use a Queue. If you push each line into a Queue, then each thread can pull a line off the Queue, handle it, and repeat until the Queue is empty. This way, the work gets distributed such that no thread is ever without work to do (until the end, of course).
Now, the second question. You're definitely right that writing to stdout from multiple threads at once isn't an ideal solution. Ideally, you would arrange things so that the writing to stdout happens in only one place. One great way to do that is to use a Queue. If you have each thread write its output to a shared Queue, then you can spawn an additional thread whose sole task is to pull items out of that Queue and print them to stdout. By restricting the printing to just one threading, you'll avoid the issues inherent in multiple threads trying to print at once.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.