python: write file once across concurrent independent invocations

python: write file once across concurrent independent invocations - python

I have a situation where multiple concurrent invocations of a python script takes place involving initializing and loading some system info in a file.
This initialization should happen only once and while it is happening the other invocations must wait somehow. And when this has happened, the other invocations must proceed with reading the file. However, since an unknown number of concurrent invocation of the program is taking place, the section is entered multiple times causing problems.
Here is my code:
#initialization has already happened, load info from file
if os.path.isfile("/tmp/corners.txt"):
logging.info("corners exist, load'em up!")
#load corners from cornersfile
cornersfile=open("/tmp/corners.txt","r")
for line in cornersfile:
corners.append((line.split()[0], line.split()[1]))`
cornersfile.close()
logging.info("corners is %s", corners)
else:
# initialize and do not let other concurrent invocations to proceed!
logging.info("initiation not done, do it!")
#init blocks and return the list of corners
#write corners to file
cornersfile=open("/tmp/corners.txt", "w")
cornersfile.write("\n".join('%s %s' % x for x in corners))
cornersfile.close()
I did some testing running the code 8 times concurrently. In the logs, I see that the first part of code enters thrice and the else part is entered 5 times.
How do I make sure the that the following happens:
If any concurrent invocation finds that the initialization (the else part) is happening, it will wait; all other concurrent invocations will go into a wait state.
If any concurrent invocation finds that the initialization has already happened (that is the file /tmp/corners.txt is present) it will be loaded up.

I understood that there are several python interpreters running. You don't use threads.
I would solve this with file locking. There is a library: https://pypi.python.org/pypi/lockfile
Example:
from lockfile import LockFile
lock = LockFile("/some/file/or/other")
with lock:
print lock.path, 'is locked.'

Related

Issues with files using Multithreading

My code is conceptually something like this
import multiprocessing.dummy
def working_with_files(test_file):
open test_file
...bunch of stuff...
create_fileA(variable)
create_fileB_from_fileA(fileA)
os.remove(fileA)
if __name__ == "__main__":
files = glob("/Users/Name/Documents/TestData/*")
pool = multiprocessing.dummy.Pool(8)
results = pool.map(working_with_files, files)
pool.close()
pool.join()
From my understanding, each thread is running concurrently, but inside each thread, its still happening in sequence. Since each thread is a function, everything inside the function should still be happening in sequence. I am, however, getting some weird errors. For example, when trying to os.remove(fileA), it says fileA doesn't exist (only occurs sometimes); however, it should exist since I'm only running that line after creating the file. These errors don't exist for single threads.

In the comment section, #AskioFrio confirmed that different threads could create files with same filenames. So I think the issue is race condition which could be illustrated with the following example (steps happening sequentially):
Thread A creates a file abc.
Thread B creates a file with the same filename abc; so abc gets overwritten.
Thread A deletes abc.
Thread B tries to delete abc, which has been deleted by thread A and thus the error occurs.
Actully the most notable race conditions happen in system memory when multiple threads try to write to the memory on the same address (e.g., writing to the same element in an array).
To avoid the race conditions, you may use lock or semaphore to coordinate the activities of threads.

Mulitprocessing pool for function with no arguments/iterable?

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!

Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

Reducing cpu usage in python multiprocessing without sacrificing responsiveness

I have a multiprocessing programs in python, which spawns several sub-processes and manages them (restarting them if the children identify problems, etc). Each subprocess is unique and their setup depends on a configuration file. The general structure of the master program is:
def main():
messageQueue = multiprocessing.Queue()
errorQueue = multiprocessing.Queue()
childProcesses = {}
for required_children in configuration:
childProcesses[required_children] = MultiprocessChild(errorQueue, messageQueue, *args, **kwargs)
for child_process in ChildProcesses:
ChildProcesses[child_process].start()
while True:
if local_uptime > configuration_check_timer: # This is to check if configuration file for processes has changed. E.g. check every 5 minutes
reload_configuration()
killChildProcessIfConfigurationChanged()
relaunchChildProcessIfConfigurationChanged()
# We want to relaunch error processes immediately (so while statement)
# Errors are not always crashes. Sometimes other system parameters change that require relaunch with different, ChildProcess specific configurations.
while not errorQueue.empty():
_error_, _childprocess_ = errorQueue.get()
killChildProcess(_childprocess_)
relaunchChildProcess(_childprocess)
print(_error_)
# Messages are allowed to lag if a configuration_timer is going to trigger or errorQueue gets something (so if statement)
if not messageQueue.empty():
print(messageQueue.get())
Is there a way to prevent the contents of the infinite while True loop take up 100pct CPU. If I add a sleep event at the end of the loop (e.g. sleep for 10s), then errors will take 10s to correct, ans messages will take 10s to flush.
If on the other hand, there was a way to have a time.sleep() for the duration of the configuration_check_timer, while still running code if messageQueue or errorQueue get stuff inside them, that would be nice.

Python Multiprocessing using Process: Consuming Large Memory

I am running multiple processes from single python code:
Code Snippet:
while 1:
if sqsObject.msgCount() > 0:
ReadyMsg = sqsObject.readM2Q()
if ReadyMsg == 0:
continue
fileName = ReadyMsg['fileName']
dirName = ReadyMsg['dirName']
uuid = ReadyMsg['uid']
guid = ReadyMsg['guid']
callback = ReadyMsg['callbackurl']
# print ("Trigger Algorithm Process")
if(countProcess < maxProcess):
try:
retValue = Process(target=dosomething, args=(dirName, uuid,guid,callback))
processArray.append(retValue)
retValue.start()
countProcess = countProcess + 1
except:
print "Cannot Run Process"
else:
for i in range(len(processArray)):
if (processArray[i].is_alive() == True):
continue
else:
try:
#print 'Restart Process'
processArray[i] = Process(target=dosomething, args=(dirName,uuid,guid,callback))
processArray[i].start()
except:
print "Cannot Run Process"
else: # No more request to service
for i in range(len(processArray)):
if (processArray[i].is_alive() == True):
processRunning = 1
break
else:
continue
if processRunning == 0:
countProcess = 0
else:
processRunning = 0
Here I am reading the messages from the queue and creating a process to run the algorithm on that message. I am putting upper limit of maxProcess. And hence after reaching maxProcess, I want to reuse the processArray slots which are not alive by checking is_alive().
This process runs fine for smaller number of processes however, for large number of messages say 100, Memory consumption goes through roof. I am thinking I have leak by reusing the process slots.
Not sure what is wrong in the process.
Thank you in advance for spotting an error or wise advise.

Your code is, in a word, weird :-)
It's not an mvce, so no one else can test it, but just looking at it, you have this (slightly simplified) structure in the inner loop:
if count < limit:
... start a new process, and increment count ...
else:
do things that can potentially start even more processes
(but never, ever, decrease count)
which seems unwise at best.
There are no invocations of a process instance's join(), anywhere. (We'll get back to the outer loop and its else case in a bit.)
Let's look more closely at the inner loop's else case code:
for i in range(len(processArray)):
if (processArray[i].is_alive() == True):
Leaving aside the unnecessary == True test—which is a bit of a risk, since the is_alive() method does not specifically promise to return True and False, just something that works boolean-ly—consider this description from the documentation (this link goes to py2k docs but py3k is the same, and your print statements imply your code is py2k anyway):
is_alive()
Return whether the process is alive.
Roughly, a process object is alive from the moment the start() method returns until the child process terminates.
Since we can't see the code for dosomething, it's hard to say whether these things ever terminate. Probably they do (by exiting), but if they don't, or don't soon enough, we could get problems here, where we just drop the message we pulled off the queue in the outer loop.
If they do terminate, we just drop the process reference from the array, by overwriting it:
processArray[i] = Process(...)
The previous value in processArray[i] is discarded. It's not clear if you may have saved this anywhere else, but if you have not, the Process instance gets discarded, and now it is actually impossible to call its join() method.
Some Python data structures tend to clean themselves up when abandoned (e.g., open streams flush output and close as needed), but the multiprocess code appears not to auto-join() its children. So this could be the, or a, source of the problem.
Finally, whenever we do get to the else case in the outer loop, we have the same somewhat odd search for any alive processes—which, incidentally, can be written more clearly as:
if any(p.is_alive() for p in processArray):
as long as we don't care about which particular ones are alive, and which are not—and if none report themselves as alive, we reset the count, but never do anything with the variable processArray, so that each processArray[i] still holds the identity of the Process instance. (So at least we could call join on each of these, excluding any lost by overwriting.)
Rather than building your own Pool yourself, you are probably better off using multiprocess.Pool and its apply and apply_async methods, as in miraculixx's answer.

Not sure what is wrong in the process.
It appears you are creating as many processes as there are messages, even when the maxProcess count is reached.
I am thinking I have leak by reusing the process slots.
There is no need to manage the processes yourself. Just use a process pool:
# before your while loop starts
from multiprocessing import Pool
pool = Pool(processes=max_process)
while 1:
...
# instead of creating a new Process
res = pool.apply_async(dosomething,
args=(dirName,uuid,guid,callback))
# after the while loop has finished
# -- wait to finish
pool.close()
pool.join()
Ways to submit jobs
Note that the Pool class supports several ways to submit jobs:
apply_async - one message at a time
map_async - a chunk of messages at a time
If messages arrive fast enough it might be better to collect several of them (say 10 or 100 at a time, depending on the actual processing done) and use map to submit a "mini-batch" to the target function at a time:
...
while True:
messages = []
# build mini-batch of messages
while len(messages) < batch_size:
... # get message
messages.append((dirName,uuid,guid,callback))
pool.map_async(dosomething, messages)
To avoid memory leaks left by dosomething you can ask the Pool to restart a process after it has consumed some number of messages:
max_tasks = 5 # some sensible number
Pool(max_processes, maxtasksperchild=max_tasks)
Going distributed
If with this approach the memory capacity is still exceeded, consider using a distributed approach i.e. add more machines. Using Celery that would be pretty straight forward, coming from the above:
# tasks.py
#task
def dosomething(...):
... # same code as before
# driver.py
while True:
... # get messages as before
res = somefunc.apply_async(args=(dirName,uuid,guid,callback))

Python multithreading without a queue working with large data sets

I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.
Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.
This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?
queue=Queue.Queue()
def worker():
while True:
task=queue.get()
try:
subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
docRoot=docRoot,
task=task
)],shell=True)
except:
pass
with lock:
stats['done']+=1
if int(time.time())!=stats.get('now'):
stats.update(
now=int(time.time()),
percent=(stats.get('done')/stats.get('total'))*100,
ps=(stats.get('done')/(time.time()-stats.get('start')))
)
print("\r {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
percent=stats.get('percent'),
progress=('='*int((23*stats.get('percent'))/100))+'>',
persec=stats.get('ps'),
done=int(stats.get('done')),
total=stats.get('total'),
eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
),end='')
queue.task_done()
for i in range(32):
workers=threading.Thread(target=worker)
workers.daemon=True
workers.start()
try:
with open(csvFile,'rb') as fh:
try:
dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
fh.seek(0)
reader=csv.reader(fh,dialect)
headers=reader.next()
except csv.Error as e:
print("\rERROR[CSV] {error}\n".format(error=e))
else:
while True:
try:
data=reader.next()
except csv.Error as e:
print("\rERROR[CSV] - Line {line}: {error}\n".format( line=reader.line_num, error=e))
except StopIteration:
break
else:
stats['total']+=1
queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
queue.join()

32 threads is probably overkill unless you have some humungous hardware available.
The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1
which comes to either 7 or 15 on most hardware.
The simplest way would be to start 7 threads passing each thread an "offset" as a parameter.
i.e. a number from 0 to 7.
Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.
This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.
I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.

I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.
Example:
from multiprocessing import Pool
def job(args):
# do some work
inputs = [...] # define your inputs
Pool().map(job, inputs)
I leave it up to you to fill in the blanks to meet your specific requirements.
See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.

Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:
import multiprocessing
from multiprocessing import Pool
def multiplyxy(x,y):
return x*y
def funkytuple(t):
"""
Breaks a tuple into a function to be called and a tuple
of arguments for that function. Changes that new tuple into
a series of arguments and passes those arguments to the
function.
"""
f = t[0]
t = t[1]
return f(*t)
def processparallel(func, arglist):
"""
Takes a function and a list of arguments for that function
and proccesses in parallel.
"""
parallelarglist = []
for entry in arglist:
parallelarglist.append((func, tuple(entry)))
cpu_count = int(multiprocessing.cpu_count() - 1)
pool = Pool(processes = cpu_count)
database = pool.map(funkytuple, parallelarglist)
pool.close()
return database
#Necessary on Windows
if __name__ == '__main__':
x = [23, 23, 42, 3254, 32]
y = [324, 234, 12, 425, 13]
i = 0
arglist = []
while i < len(x):
arglist.append([x[i],y[i]])
i += 1
database = processparallel(multiplyxy, arglist)
print(database)

Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?
myq = Queue.Queue(maxsize=64)
Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.
EDIT
This is current script. It appears that it is reading the
entire csv file into queue and doing a queue.join(). Is
it correct that it is loading the entire csv into a queue
then spawning the threads?
The indentation is messed up in your post, so have to guess some, but:
The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.
To get better answers, supply better info ;-)
ANOTHER EDIT
Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.
BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.
And note what #JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.