Can list.append(foo) be not thread-safe? - python

I've used multi-threading a lot, while appending to the same list from different threads. Everything worked fine.
However, I'm getting a problem with list appending when the threads are like 70 threads or more. Appending in the last thread gets stuck for like 5 mins (the processor is not occubied at this time, maybe 10 %. So, it's not a hardware problem I would say). Then appending occurs successfully.
At this link, it says that list appending is thread-safe.
My question is: Can list appending ever become not thread safe?
Don't ask for a code or so. I just need a simple yes or no to my question. And if yes, kindly provide suggestions to fix that.

list appending in python is thread safe.
For more details: what kinds of global value mutation are thread safe
You last thread gets stuck maybe due to other reason, for example: Memory allocating..
My first step to fix the stuck is use strace to trace the syscall.
And you can use gdb to print all threads' call stack too. Here is a wiki page: https://wiki.python.org/moin/DebuggingWithGdb

Related

Concurrent access to one file by several unrelated processes on macOS

I need to get several processes to communicate with each others on a macOS system. These processes will be spawned several times per day at different times, and I cannot predict when they will be up at the same time (if ever). These programs are in python or swift.
How can I safely allow these programs to all write to the same file?
I have explored a few different options:
I thought of using sqlite3, however I couldn't find an answer in the documentation on whether it was safe to write concurrently across processes. This question is not very definitive, old, and I would ideally like to get a more authoritative answer.
I thought of using multiprocesing as it supports locks. However, as far as I could see in the documentation, you need a meta-process that spawns the children and stays up for the duration of the longest child process. I am am fine having a meta-spawner process, but it feels wasteful to have a meta-process basically staying up all day long, just to resolve conflicting access ?
Along the lines of this, I thought of having a process that stays up all day long, and receive messages from all other processes, and is the sole responsible for writing to file. It feels a little wasteful, how worried should I be about the resource cost of having a program up all day, and doing little? Are the only thing to worry about memory footprint and CPU usage (as shown in activity monitor), or could there be other significant costs, e.g. context switching?
I have come across flock on linux, that seems to be a locking mechanism to access files, provided by the OS. This seems like a good solution, but this does not seem to exist on macOS?
Any idea to solve this requirement in a robust manner (so that I don't have to debug every other day - I now concurrency can be a pain), is most welcome!
While you are in control of the source code of all such processes, you could use flock. It will put the advisory lock on file, so the other writer will be blocked only in case he is also access the file the same way. This is OK for you, if only your processes will ever need to write to the shared file.
I've tested flock on BigSur, it is still implemented and works fine.
You can also do it in any other common manner: create temporary .lock file in the known location(this is what git does), and remove it after current writer is done with the main file; use semaphores; etc

Does Python ever stop an infinite "for" loop?

Does python automatically terminate or generate an error for an infinite "for" loop?
for example, would
list = [1]
for number in list:
print (number + 1)
list.append(number + 1)
cause the program to self-terminate or error?
I have tested it a bit, going as far as around 10,000.
No, the program will run forever. It may crash because your call stack is to big (if you use recursion somewhere) or you used to much memory though. Your example will eventually crash.
while True:
pass
Will run "forever" with no issues however.
That program will end, but it will be a very long time until it does. Eventually, you'll run the system out of memory for your list and the program will crash with an error.
It won't unless you specify a range or a limit. The way you currently have it it'll keep going until you get a memory error.
Also, don't use list as a variable since it's a reserved function by python.

file.read() multiprocessing and the GIL

I've read that certain Python functions implemented in C, which I assume includes file.read(), can release the GIL while they're working and then get it back on completion and by doing so make use of multiple cores if they're available.
I'm using multiprocess to parallelize some code and currently I've got three processes, the parent, one child that reads data from a file, and one child that generates a checksum from the data passed to it by the first child process.
Now if I'm understanding this right, it seems that creating a new process to read the file as I'm currently doing is uneccessary and I should just call it in the main process. The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?
So given my function to read and pipe the data to be processed:
def read(file_path, pipe_out):
with open(file_path, 'rb') as file_:
while True:
block = file_.read(block_size)
if not block:
break
pipe_out.send(block)
pipe_out.close()
I reckon that this will definitely make use of multiple cores, but also introduces some overhead:
multiprocess.Process(target=read, args).start()
But now I'm wondering if just doing this will also use multiple cores, minus the overhead:
read(*args)
Any insights anybody has as to which one would be faster and for what reason would be much appreciated!
Okay, as came out by the comments, the actual question is:
Does (C)Python create threads on its own, and if so, how can I make use of that?
Short answer: No.
But, the reason why these C-Functions are nevertheless interesting for Python programmers is the following. By default, no two snippets of python code running in the same interpreter can execute in parallel, this is due to the evil called the Global Interpreter Lock, aka the GIL. The GIL is held whenever the interpreter is executing Python code, which implies the above statement, that no two pieces of python code can run in parallel in the same interpreter.
Nevertheless, you can still make use of multithreading in python, namely when you're doing a lot of I/O or make a lot of use of external libraries like numpy, scipy, lxml and so on, which all know about the issue and release the GIL whenever they can (i.e. whenever they do not need to interact with the python interpreter).
I hope that cleared up the issue a bit.
I think this is the main part of your question:
The question is am I understanding this right and will I get better
performance with the read kept in the main process or in a separate
one?
I assume your goal is to read and process the file as fast as possible. File reading is in any case I/O bound and not CPU bound. You cannot process data faster than you are able to read it. So file I/O clearly limits the performance of your software. You cannot increase the read data rate by using concurrent threads/processes for file reading. Also 'low level' CPython is not doing this. As long as you read the file in one process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. It is also fine if you do the file reading in the main thread as long as there are no other blocking calls that would actually slow down the file reading.

Pythonistas, please help convert this to utilize Python Threading concepts

Update : For anyone wondering what I went with at the end -
I divided the result-set into 4 and ran 4 instances of the same program with one argument each indicating what set to process. It did the trick for me. I also consider PP module. Though it worked, it prefer the same program. Please pitch in if this is a horrible implementation! Thanks..
Following is what my program does. Nothing memory intensive. It is serial processing and boring. Could you help me convert this to more efficient and exciting process? Say, I process 1000 records this way and with 4 threads, I can get it to run in 25% time!
I read articles on how python threading can be inefficient if done wrong. Even python creator says the same. So I am scared and while I am reading more about them, want to see if bright folks on here can steer me in the right direction. Muchos gracias!
def startProcessing(sernum, name):
'''
Bunch of statements depending on result,
will write to database (one update statement)
Try Catch blocks which upon failing,
will call this function until the request succeeds.
'''
for record in result:
startProc = startProcessing(str(record[0]), str(record[1]))
Python threads can't run at the same time due to the Global Interpreter Lock; you want new processes instead. Look at the multiprocessing module.
(I was instructed to post this as an answer =p.)

Python - Working around memory leaks

I have a Python program that runs a series of experiments, with no data intended to be stored from one test to another. My code contains a memory leak which I am completely unable to find (I've look at the other threads on memory leaks). Due to time constraints, I have had to give up on finding the leak, but if I were able to isolate each experiment, the program would probably run long enough to produce the results I need.
Would running each test in a separate thread help?
Are there any other methods of isolating the effects of a leak?
Detail on the specific situation
My code has two parts: an experiment runner and the actual experiment code.
Although no globals are shared between the code for running all the experiments and the code used by each experiment, some classes/functions are necessarily shared.
The experiment runner isn't just a simple for loop that can be easily put into a shell script. It first decides on the tests which need to be run given the configuration parameters, then runs the tests then outputs the data in a particular way.
I tried manually calling the garbage collector in case the issue was simply that garbage collection wasn't being run, but this did not work
Update
Gnibbler's answer has actually allowed me to find out that my ClosenessCalculation objects which store all of the data used during each calculation are not being killed off. I then used that to manually delete some links which seems to have fixed the memory issues.
You can use something like this to help track down memory leaks
>>> from collections import defaultdict
>>> from gc import get_objects
>>> before = defaultdict(int)
>>> after = defaultdict(int)
>>> for i in get_objects():
... before[type(i)] += 1
...
now suppose the tests leaks some memory
>>> leaked_things = [[x] for x in range(10)]
>>> for i in get_objects():
... after[type(i)] += 1
...
>>> print [(k, after[k] - before[k]) for k in after if after[k] - before[k]]
[(<type 'list'>, 11)]
11 because we have leaked one list containing 10 more lists
Threads would not help. If you must give up on finding the leak, then the only solution to contain its effect is running a new process once in a while (e.g., when a test has left overall memory consumption too high for your liking -- you can determine VM size easily by reading /proc/self/status in Linux, and other similar approaches on other OS's).
Make sure the overall script takes an optional parameter to tell it what test number (or other test identification) to start from, so that when one instance of the script decides it's taking up too much memory, it can tell its successor where to restart from.
Or, more solidly, make sure that as each test is completed its identification is appended to some file with a well-known name. When the program starts it begins by reading that file and thus knows what tests have already been run. This architecture is more solid because it also covers the case where the program crashes during a test; of course, to fully automate recovery from such crashes, you'll want a separate watchdog program and process to be in charge of starting a fresh instance of the test program when it determines the previous one has crashed (it could use subprocess for the purpose -- it also needs a way to tell when the sequence is finished, e.g. a normal exit from the test program could mean that while any crash or exit with a status != 0 signify the need to start a new fresh instance).
If these architectures appeal but you need further help implementing them, just comment to this answer and I'll be happy to supply example code -- I don't want to do it "preemptively" in case there are as-yet-unexpressed issues that make the architectures unsuitable for you. (It might also help to know what platforms you need to run on).
I had the same problem with a third party C library which was leaking. The most clean work-around that I could think of was to fork and wait. The advantage of it is that you don't even have to create a separate process after each run. You can define the size of your batch.
Here's a general solution (if you ever find the leak, the only change you need to make is to change run() to call run_single_process() instead of run_forked() and you'll be done):
import os,sys
batchSize = 20
class Runner(object):
def __init__(self,dataFeedGenerator,dataProcessor):
self._dataFeed = dataFeedGenerator
self._caller = dataProcessor
def run(self):
self.run_forked()
def run_forked(self):
dataFeed = self._dataFeed
dataSubFeed = []
for i,dataMorsel in enumerate(dataFeed,1):
if i % batchSize > 0:
dataSubFeed.append(dataMorsel)
else:
self._dataFeed = dataSubFeed
self.fork()
dataSubFeed = []
if self._child_pid is 0:
self.run_single_process()
self.endBatch()
def run_single_process(self)
for dataMorsel in self._dataFeed:
self._caller(dataMorsel)
def fork(self):
self._child_pid = os.fork()
def endBatch(self):
if self._child_pid is not 0:
os.waitpid(self._child_pid, 0)
else:
sys.exit() # exit from the child when done
This isolates the memory leak to the child process. And it will never leak more times than the value of the batchSize variable.
I would simply refactor the experiments into individual functions (if not like that already) then accept an experiment number from the command line which calls the single experiment function.
The just bodgy up a shell script as follows:
#!/bin/bash
for expnum in 1 2 3 4 5 6 7 8 9 10 11 ; do
python youProgram ${expnum} otherParams
done
That way, you can leave most of your code as-is and this will clear out any memory leaks you think you have in between each experiment.
Of course, the best solution is always to find and fix the root cause of a problem but, as you've already stated, that's not an option for you.
Although it's hard to imagine a memory leak in Python, I'll take your word on that one - you may want to at least consider the possibility that you're mistaken there, however. Consider raising that in a separate question, something that we can work on at low priority (as opposed to this quick-fix version).
Update: Making community wiki since the question has changed somewhat from the original. I'd delete the answer but for the fact I still think it's useful - you could do the same to your experiment runner as I proposed the bash script for, you just need to ensure that the experiments are separate processes so that memory leaks dont occur (if the memory leaks are in the runner, you're going to have to do root cause analysis and fix the bug properly).

Categories

Resources