Why `pathlib.Path("xxx/yyy").unlink / mkdir / rmdir` are not sync operations?

Why `pathlib.Path("xxx/yyy").unlink / mkdir / rmdir` are not sync operations? - python

I am using python's pathlib.Path module, with pytorch's DistributedDataParallel together.
When I use multi-process using DistributedDataParallel, I delete or create a file with Path("xxx/yyy").rmdir / .unlink / .mkdir only in process 0 which is local_rank: 0.
And then the weird thing happened，it seems that the Path("xxx/yyy").rmdir / .unlink / .mkdir operations is not synchronous, which i means that the process in rank0 runs but does not wait until the file opertion ends. So if there is a file check after that operation, for example Path("xxx/yyy").parent.iterdir, the results in rank0 and rank1 or other ranks are not equal, which means that rank0 finds or does not find some files that other ranks does not find or find.
The problem has been fixed by adding a synchronous lock, the code is simple and as follows:
from time import sleep
def wait_to_success(fn, no=False):
while True:
sleep(0.01)
if (no and not fn()) or (not no and fn()):
break
And this function is used as follows:
remove_file_operation(filepath)
wait_to_success(filepath.exist, no=True)
or
create_file_operation(filepath)
wait_to_success(filepath.exist, no=False)
And it is needed to notice that the torch.distributed.barrier() function does not work for this problem.
So I am very confused why pathlib.Path operation does not wait for this operation ends?

Related

Mulitprocessing pool for function with no arguments/iterable?

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!

Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

How to have a multi-procsesing function return and store values in python?

I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?

In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)

You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.

If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.

Asynchronous file IO in multi-threaded application with variable index

In one of my programs, I have the following method :
def set_part_content(self, part_no, block_no, data):
with open(self.file_path, "rwb+") as f:
f.seek(part_no * self.constant1 + block_no * self.constant2)
f.write(data)
I did this this way because of the following :
I have to write at different index (the reason why the f.seek is here)
and this function is thread safe (thanks to the with statement)
My issue is this function is called approximately 10k to 100k times, and obviously it is really really slow (it represent half of the execution time of one of my most critical set of functionality) because of the opening/closing time.
Because of the f.seek, I can't open the file directly in the __init__ function in order to operate on it (if 2 thread use the function at the same time, it result in a bad index for one of this two, which is critical).
Is there any module / way that could accelerate this function ?

I am not familiar with python, so I am not sure if with statement will make this function thread safe or not.
If it does, you won't have 2 threads using the function at the same time.
If it doesn't, your function is not thread safe. You need a lock to ensure thread safety. You may check implementing file locks using Python's with statement. Another option is to make this function async, and let single thread to handle all the writing.

Ensuring that my program is not doing a concurrent file write

I am writing a script that is required to perform safe-writes to any given file i.e. append a file if no other process is known to be writing into it. My understanding of the theory was that concurrent writes were prevented using write locks on the file system but it seems not to be the case in practice.
Here's how I set up my test case:
I am redirecting the output of a ping command:
ping 127.0.0.1 > fileForSafeWrites.txt
On the other end, I have the following python code attempting to write to the file:
handle = open('fileForSafeWrites.txt', 'w')
handle.write("Probing for opportunity to write")
handle.close()
Running concurrently both processes gracefully complete. I see that fileForSafeWrites.txt has turned into a file with binary content, instead of a write lock issued by the first process that protects it from being written into by the Python code.
How do I force either or both of my concurrent processes not to interfere with each other? I have read people advise the ability to get a write file handle as evidence for the file being write to safe, such as in https://stackoverflow.com/a/3070749/1309045
Is this behavior specific to my Operating System and Python. I use Python2.7 in an Ubuntu 12.04 environment.

Use the lockfile module as shown in Locking a file in Python

Inspired from a solution described for concurrency checks, I came up with the following snippet of code. It works if one is able to appropriately predict the frequency at which the file in question is written. The solution is through the use of file-modification times.
import os
import time
'''Find if a file was modified in the last x seconds given by writeFrequency.'''
def isFileBeingWrittenInto(filename,
writeFrequency = 180, overheadTimePercentage = 20):
overhead = 1+float(overheadTimePercentage)/100 # Add some buffer time
maxWriteFrequency = writeFrequency * overhead
modifiedTimeStart = os.stat(filename).st_mtime # Time file last modified
time.sleep(writeFrequency) # wait writeFrequency # of secs
modifiedTimeEnd = os.stat(filename).st_mtime # File modification time again
if 0 < (modifiedTimeEnd - modifiedTimeStart) <= maxWriteFrequency:
return True
else:
return False
if not isFileBeingWrittenInto('fileForSafeWrites.txt'):
handle = open('fileForSafeWrites.txt', 'a')
handle.write("Text written safely when no one else is writing to the file")
handle.close()
This does not do true concurrency checks but can be combined with a variety of other methods for practical purposes to safely write into a file without having to worry about garbled text. Hope it helps the next person searching for a way to do this.
EDIT UPDATE:
Upon further testing, I encountered a high-frequency write process that required the conditional logic to be modified from
if 0 < (modifiedTimeEnd - modifiedTimeStart) < maxWriteFrequency
to
if 0 < (modifiedTimeEnd - modifiedTimeStart) <= maxWriteFrequency
That makes a better answer, in theory and in practice.

Python multithreading without a queue working with large data sets

I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.
Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.
This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?
queue=Queue.Queue()
def worker():
while True:
task=queue.get()
try:
subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
docRoot=docRoot,
task=task
)],shell=True)
except:
pass
with lock:
stats['done']+=1
if int(time.time())!=stats.get('now'):
stats.update(
now=int(time.time()),
percent=(stats.get('done')/stats.get('total'))*100,
ps=(stats.get('done')/(time.time()-stats.get('start')))
)
print("\r {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
percent=stats.get('percent'),
progress=('='*int((23*stats.get('percent'))/100))+'>',
persec=stats.get('ps'),
done=int(stats.get('done')),
total=stats.get('total'),
eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
),end='')
queue.task_done()
for i in range(32):
workers=threading.Thread(target=worker)
workers.daemon=True
workers.start()
try:
with open(csvFile,'rb') as fh:
try:
dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
fh.seek(0)
reader=csv.reader(fh,dialect)
headers=reader.next()
except csv.Error as e:
print("\rERROR[CSV] {error}\n".format(error=e))
else:
while True:
try:
data=reader.next()
except csv.Error as e:
print("\rERROR[CSV] - Line {line}: {error}\n".format( line=reader.line_num, error=e))
except StopIteration:
break
else:
stats['total']+=1
queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
queue.join()

32 threads is probably overkill unless you have some humungous hardware available.
The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1
which comes to either 7 or 15 on most hardware.
The simplest way would be to start 7 threads passing each thread an "offset" as a parameter.
i.e. a number from 0 to 7.
Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.
This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.
I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.

I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.
Example:
from multiprocessing import Pool
def job(args):
# do some work
inputs = [...] # define your inputs
Pool().map(job, inputs)
I leave it up to you to fill in the blanks to meet your specific requirements.
See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.

Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:
import multiprocessing
from multiprocessing import Pool
def multiplyxy(x,y):
return x*y
def funkytuple(t):
"""
Breaks a tuple into a function to be called and a tuple
of arguments for that function. Changes that new tuple into
a series of arguments and passes those arguments to the
function.
"""
f = t[0]
t = t[1]
return f(*t)
def processparallel(func, arglist):
"""
Takes a function and a list of arguments for that function
and proccesses in parallel.
"""
parallelarglist = []
for entry in arglist:
parallelarglist.append((func, tuple(entry)))
cpu_count = int(multiprocessing.cpu_count() - 1)
pool = Pool(processes = cpu_count)
database = pool.map(funkytuple, parallelarglist)
pool.close()
return database
#Necessary on Windows
if __name__ == '__main__':
x = [23, 23, 42, 3254, 32]
y = [324, 234, 12, 425, 13]
i = 0
arglist = []
while i < len(x):
arglist.append([x[i],y[i]])
i += 1
database = processparallel(multiplyxy, arglist)
print(database)

Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?
myq = Queue.Queue(maxsize=64)
Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.
EDIT
This is current script. It appears that it is reading the
entire csv file into queue and doing a queue.join(). Is
it correct that it is loading the entire csv into a queue
then spawning the threads?
The indentation is messed up in your post, so have to guess some, but:
The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.
To get better answers, supply better info ;-)
ANOTHER EDIT
Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.
BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.
And note what #JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.