Avoid increased runtime when opening threads in consecutive runs

Avoid increased runtime when opening threads in consecutive runs - python

I'm doing my final thesis and my topic is the creation of a software that will run and control an on-satellite experiment.
For that reason, I had to implement the reading of multiple sensors while the experiment is running. To do that, I wrote the code so that it will create a new thread for each sensor (multiprocessing might not work because I don't yet know which system the software will run on and therefore I can't say if there will be multiple processors available) and these threads run as daemons all the while the software does its thing. It works well, but now I need to test the whole thing and this is where it gets problematic:
To properly test each and every route the software could take, I have multiple variables that need to be set and so there will be a lot of test runs (I calculated around 17.000 but could be wrong). While the first few test runs go over quickly, each run takes longer and longer. I have fiddled around with my code a little bit and it turns out that without threading, each test takes about the same time. Unfortunately, I do not know why and my knowledge of the matter is very limited. The code concerning the threading is as follows:
This sets up the creation of each thread (sensor_list will be populated with multiple sensors in non-test conditions)
sensor_list = [<a single sensor>]
for sensor in sensor_list:
thread = threading.Thread(
target=self.store_sensor_data,
args=[sensor, query_frequency],
daemon=True,
name=f"Thread_{sensor}",
)
self.threads.append(thread)
thread.start()
The function which actually deals with getting and writing the sensor data, self.store_sensor_data, looks like this:
def store_sensor_data(self, sensor, frequency):
"""Get the current reading and result from 'sensor' and store them.
sensor (Sensor) - the sensor whose data shall be stored
frequency (int) - the frequency (in 1/s) at which data shall be stored
"""
value_id = 0
while not self.HALT:
value_id += 1
sensor_reading = sensor.get_reading()
sensor_result = sensor.get_result()
try:
# if there already is a list for that sensor, append the data to it
self.experiment_report.sensor_data_raw[str(sensor)].append(
(value_id, sensor_reading)
)
except KeyError:
# if there is no list, create one containing the current sensor value
self.experiment_report.sensor_data_raw[str(sensor)] = [
(value_id, sensor_reading)
]
# repeat the same for the 'result'
try:
self.experiment_report.sensor_data[str(sensor)].append(
(value_id, sensor_result)
)
except KeyError:
self.experiment_report.sensor_data[str(sensor)] = [
(value_id, sensor_result)
]
time.sleep(1 / frequency)
after the experiment is done, I stop the threads by calling
def interrupt_sensor_data_recording(self):
"""Interrupt the storing of sensor data by ending all daemon threads.
threads (list) - a list of currently running threads
"""
if len(self.threads) > 0:
self.HALT = True
for thread in self.threads:
if thread.is_alive():
logger.debug(f"Stopping thread '{thread.getName()}'")
thread.join()
else:
thread.join()
logger.debug(f"Thread '{thread.getName()}' was already stopped")
Now I am unsure if how I stop the daemon threads is appropriate and this might be the source of my problems. But there also might be some implication that I don't know about yet and in both cases, it would be nice if someone with more knowledge than me could help me out here.
Thanks in advance!

Related

Python multiprocessing gradually increases memory until it runs our

I have a python program with multiple modules. They go like this:
Job class that is the entry point and manages the overall flow of the program
Task class that is the base class for the tasks to be run on given data. Many SubTask classes created specifically for different types of calculations on different columns of data are derived from the Task class. think of 10 columns in the data and each one having its own Task to do some processing. eg. 'price' column can used by a CurrencyConverterTask to return local currency values and so on.
Many other modules like a connector for getting data, utils module etc, which I don't think are relevant for this question.
The general flow of program: get data from the db continuously -> process the data -> write back the updated data to the db.
I decided to do it in multiprocessing because the tasks are relatively simple. Most of them do some basic arithmetic or logic operations and running it in one process takes a long time, especially getting data from a large db and processing in sequence is very slow.
So the multiprocessing (mp) code looks something like this (I cannot expose the entire file so i'm writing a simplified version, the parts not included are not relevant here. I've tested by commenting them out so this is an accurate representation of the actual code):
class Job():
def __init__():
block_size = 100 # process 100 rows at a time
some_query = "SELECT * IF A > B" # some query to filter data from db
def data_getter():
# continusouly get data from the db and put it into a queue in blocks
cursor = Connector.get_data(some_query)
block = []
for item in cursor:
block.append(item)
if len(block) ==block_size:
data_queue.put(data)
block = []
data_queue.put(None) # this will indicate the worker processors when to stop
def monitor():
# continuously monitor the system stats
timer = Timer()
while (True):
if timer.time_taken >= 60: # log some stats every 60 seconds
print(utils.system_stats())
timer.reset()
def task_runner():
while True:
# get data from the queue
# if there's no data, break out of loop
data = data_queue.get()
if data is None:
break
# run task one by one
for task in tasks:
task.do_something(data)
def run():
# queue to put data for processing
data_queue = mp.Queue()
# start a process for reading data from db
dg = mp.Process(target=self.data_getter).start()
# start a process for monitoring system stats
mon = mp.Process(target=self.monitor).start()
# get a list of tasks to run
tasks = [t for t in taskmodule.get_subtasks()]
workers = []
# start 4 processes to do the actual processing
for _ in range(4):
worker = mp.Process(target=task_runner)
worker.start()
workers.append(worker)
for w in workers:
w.join()
mon.terminate() # terminate the monitor process
dg.terminate() # end the data getting process
if __name__ == "__main__":
job = Job()
job.run()
The whole program is run like: python3 runjob.py
Expected behaviour: continuous stream of data goes in the data_queue and the each worker process gets the data and processes until there's no more data from the cursor at which point the workers finish and the entire program finishes.
This is working as expected but what is not expected is that the system memory usage keeps creeping up continuously until the system crashes. The data i'm getting here is not copied anywhere (at least intentionally). I expect the memory usage to be steady throughout the program. The length of the data_queue rarely exceeds 1 or 2 since the processes are fast enough to get the data when available so It's not the queue holding too much data.
My guess is that all the processes initiated here are long running ones and that has something to do with this. Although I can print the pid and if I follow the PID on top command the data_getter and monitor processes don't exceed more than 2% of memory usage. the 4 worker processes also don't use a lot of memory. And neither does the main process the whole thing runs in. there is an unaccounted for process that takes up 20%+ of the ram. And it bugs me so much I can't figure out what it is.

Reducing cpu usage in python multiprocessing without sacrificing responsiveness

I have a multiprocessing programs in python, which spawns several sub-processes and manages them (restarting them if the children identify problems, etc). Each subprocess is unique and their setup depends on a configuration file. The general structure of the master program is:
def main():
messageQueue = multiprocessing.Queue()
errorQueue = multiprocessing.Queue()
childProcesses = {}
for required_children in configuration:
childProcesses[required_children] = MultiprocessChild(errorQueue, messageQueue, *args, **kwargs)
for child_process in ChildProcesses:
ChildProcesses[child_process].start()
while True:
if local_uptime > configuration_check_timer: # This is to check if configuration file for processes has changed. E.g. check every 5 minutes
reload_configuration()
killChildProcessIfConfigurationChanged()
relaunchChildProcessIfConfigurationChanged()
# We want to relaunch error processes immediately (so while statement)
# Errors are not always crashes. Sometimes other system parameters change that require relaunch with different, ChildProcess specific configurations.
while not errorQueue.empty():
_error_, _childprocess_ = errorQueue.get()
killChildProcess(_childprocess_)
relaunchChildProcess(_childprocess)
print(_error_)
# Messages are allowed to lag if a configuration_timer is going to trigger or errorQueue gets something (so if statement)
if not messageQueue.empty():
print(messageQueue.get())
Is there a way to prevent the contents of the infinite while True loop take up 100pct CPU. If I add a sleep event at the end of the loop (e.g. sleep for 10s), then errors will take 10s to correct, ans messages will take 10s to flush.
If on the other hand, there was a way to have a time.sleep() for the duration of the configuration_check_timer, while still running code if messageQueue or errorQueue get stuff inside them, that would be nice.

Separate computation from socket work in Python

I'm serializing column data and then sending it over a socket connection.
Something like:
import array, struct, socket
## Socket setup
s = socket.create_connection((ip, addr))
## Data container setup
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
## Binarize data
columns['col1'] = array.array('i', range(10000))
columns['col2'] = array.array('f', [float(num) for num in range(10000)])
.
.
.
## Send away
chunk = b''.join(columns[col_name] for col_name in ordered_col_list]
s.sendall(chunk)
s.recv(1000) #get confirmation
I wish to separate the computation from the sending, put them on separate threads or processes, so I can keep doing computations while data is sent away.
I've put the binarizing part as a generator function, then sent the generator to a separate thread, which then yielded binary chunks via a queue.
I collected the data from the main thread and sent it away. Something like:
import array, struct, socket
from time import sleep
try:
import thread
from Queue import Queue
except:
import _thread as thread
from queue import Queue
## Socket and queue setup
s = socket.create_connection((ip, addr))
chunk_queue = Queue()
def binarize(num_of_chunks):
''' Generator function that yields chunks of binary data. In reality it wouldn't be the same data'''
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
columns['col1'] = array.array('i', range(10000)).tostring()
columns['col2'] = array.array('f', [float(num) for num in range(10000)]).tostring()
.
.
yield b''.join((columns[col_name] for col_name in ordered_col_list))
def chunk_yielder(queue):
''' Generate binary chunks and put them on a queue. To be used from a thread '''
while True:
try:
data_gen = queue.get_nowait()
except:
sleep(0.1)
continue
else:
for chunk in data_gen:
queue.put(chunk)
## Setup thread and data generator
thread.start_new_thread(chunk_yielder, (chunk_queue,))
num_of_chunks = 100
data_gen = binarize(num_of_chunks)
queue.put(data_gen)
## Get data back and send away
while True:
try:
binary_chunk = queue.get_nowait()
except:
sleep(0.1)
continue
else:
socket.sendall(binary_chunk)
socket.recv(1000) #Get confirmation
However, I did not see and performance imporovement - it did not work faster.
I don't understand threads/processes too well, and my question is whether it is possible (at all and in Python) to gain from this type of separation, and what would be a good way to go about it, either with threads or processess (or any other way - async etc).
EDIT:
As far as I've come to understand -
Multirpocessing requires serializing any sent data, so I'm double-sending every computed data.
Sending via socket.send() should release the GIL
Therefore I think (please correct me if I am mistaken) that a threading solution is the right way. However I'm not sure how to do it correctly.
I know cython can release the GIL off of threads, but since one of them is just socket.send/recv, my understanding is that it shouldn't be necessary.

You have two options for running things in parallel in Python, either use the multiprocessing (docs) library , or write the parallel code in cython and release the GIL. The latter is significantly more work and less applicable generally speaking.
Python threads are limited by the Global Interpreter Lock (GIL), I won't go into detail here as you will find more than enough information online on it. In short, the GIL, as the name suggests, is a global lock within the CPython interpreter that ensures multiple threads do not modify objects, that are within the confines of said interpreter, simultaneously. This is why, for instance, cython programs can run code in parallel because they can exist outside the GIL.
As to your code, one problem is that you're running both the number crunching (binarize) and the socket.send inside the GIL, this will run them strictly serially. The queue is also connected very strangely, and there is a NameError but let's leave those aside.
With the caveats already pointed out by Jeremy Friesner in mind, I suggest you re-structure the code in the following manner: you have two processes (not threads) one for binarising the data and the other for sending data. In addition to those, there is also the parent process that started both children, and a queue connecting child 1 to child 2.
Subprocess-1 does number crunching and produces crunched data into a queue
Subprocess-2 consumes data from a queue and does socket.send
in code the setup would look something like
from multiprocessing import Process, Queue
work_queue = Queue()
p1 = Process(target=binarize, args=(100, work_queue))
p2 = Process(target=send_data, args=(ip, port, work_queue))
p1.start()
p2.start()
p1.join()
p2.join()
binarize can remain as it is in your code, with the exception that instead of a yield at the end, you add elements into the queue
def binarize(num_of_chunks, q):
''' Generator function that yields chunks of binary data. In reality it wouldn't be the same data'''
ordered_col_list = ('col1', 'col2')
columns = dict.fromkeys(ordered_col_list)
for i in range(num_of_chunks):
columns['col1'] = array.array('i', range(10000)).tostring()
columns['col2'] = array.array('f', [float(num) for num in range(10000)]).tostring()
data = b''.join((columns[col_name] for col_name in ordered_col_list))
q.put(data)
send_data should just be the while loop from the bottom of your code, with the connection open/close functionality
def send_data(ip, addr, q):
s = socket.create_connection((ip, addr))
while True:
try:
binary_chunk = q.get(False)
except:
sleep(0.1)
continue
else:
socket.sendall(binary_chunk)
socket.recv(1000) # Get confirmation
# maybe remember to close the socket before killing the process
Now you have two (three actually if you count the parent) processes that are processing data independently. You can force the two processes to synchronise their operations by setting the max_size of the queue to a single element. The operation of these two separate processes is also easy to monitor from the process manager on your computer top (Linux), Activity Monitor (OsX), don't remember what it's called under Windows.
Finally, Python 3 comes with the option of using co-routines which are neither processes nor threads, but something else entirely. Co-routines are pretty cool from a CS point of view, but a bit of a head scratcher at first. There is plenty of resources to learn from though, like this post on Medium and this talk by David Beazley.
Even more generally, you might want to look into the producer/consumer pattern, if you are not already familiar with it.

If you are trying to use concurrency to improve performance in CPython I would strongly recommend using multiprocessing library instead of multithreading. It is because of GIL (Global Interpreter Lock), which can have a huge impact on execution speed (in some cases, it may cause your code to run slower than single threaded version). Also, if you would like to learn more about this topic, I recommend reading this presentation by David Beazley. Multiprocessing bypasses this problem by spawning a new Python interpreter instance for each process, thus allowing you to take full advantage of multi core architecture.

Python Multiprocessing using Process: Consuming Large Memory

I am running multiple processes from single python code:
Code Snippet:
while 1:
if sqsObject.msgCount() > 0:
ReadyMsg = sqsObject.readM2Q()
if ReadyMsg == 0:
continue
fileName = ReadyMsg['fileName']
dirName = ReadyMsg['dirName']
uuid = ReadyMsg['uid']
guid = ReadyMsg['guid']
callback = ReadyMsg['callbackurl']
# print ("Trigger Algorithm Process")
if(countProcess < maxProcess):
try:
retValue = Process(target=dosomething, args=(dirName, uuid,guid,callback))
processArray.append(retValue)
retValue.start()
countProcess = countProcess + 1
except:
print "Cannot Run Process"
else:
for i in range(len(processArray)):
if (processArray[i].is_alive() == True):
continue
else:
try:
#print 'Restart Process'
processArray[i] = Process(target=dosomething, args=(dirName,uuid,guid,callback))
processArray[i].start()
except:
print "Cannot Run Process"
else: # No more request to service
for i in range(len(processArray)):
if (processArray[i].is_alive() == True):
processRunning = 1
break
else:
continue
if processRunning == 0:
countProcess = 0
else:
processRunning = 0
Here I am reading the messages from the queue and creating a process to run the algorithm on that message. I am putting upper limit of maxProcess. And hence after reaching maxProcess, I want to reuse the processArray slots which are not alive by checking is_alive().
This process runs fine for smaller number of processes however, for large number of messages say 100, Memory consumption goes through roof. I am thinking I have leak by reusing the process slots.
Not sure what is wrong in the process.
Thank you in advance for spotting an error or wise advise.

Your code is, in a word, weird :-)
It's not an mvce, so no one else can test it, but just looking at it, you have this (slightly simplified) structure in the inner loop:
if count < limit:
... start a new process, and increment count ...
else:
do things that can potentially start even more processes
(but never, ever, decrease count)
which seems unwise at best.
There are no invocations of a process instance's join(), anywhere. (We'll get back to the outer loop and its else case in a bit.)
Let's look more closely at the inner loop's else case code:
for i in range(len(processArray)):
if (processArray[i].is_alive() == True):
Leaving aside the unnecessary == True test—which is a bit of a risk, since the is_alive() method does not specifically promise to return True and False, just something that works boolean-ly—consider this description from the documentation (this link goes to py2k docs but py3k is the same, and your print statements imply your code is py2k anyway):
is_alive()
Return whether the process is alive.
Roughly, a process object is alive from the moment the start() method returns until the child process terminates.
Since we can't see the code for dosomething, it's hard to say whether these things ever terminate. Probably they do (by exiting), but if they don't, or don't soon enough, we could get problems here, where we just drop the message we pulled off the queue in the outer loop.
If they do terminate, we just drop the process reference from the array, by overwriting it:
processArray[i] = Process(...)
The previous value in processArray[i] is discarded. It's not clear if you may have saved this anywhere else, but if you have not, the Process instance gets discarded, and now it is actually impossible to call its join() method.
Some Python data structures tend to clean themselves up when abandoned (e.g., open streams flush output and close as needed), but the multiprocess code appears not to auto-join() its children. So this could be the, or a, source of the problem.
Finally, whenever we do get to the else case in the outer loop, we have the same somewhat odd search for any alive processes—which, incidentally, can be written more clearly as:
if any(p.is_alive() for p in processArray):
as long as we don't care about which particular ones are alive, and which are not—and if none report themselves as alive, we reset the count, but never do anything with the variable processArray, so that each processArray[i] still holds the identity of the Process instance. (So at least we could call join on each of these, excluding any lost by overwriting.)
Rather than building your own Pool yourself, you are probably better off using multiprocess.Pool and its apply and apply_async methods, as in miraculixx's answer.

Not sure what is wrong in the process.
It appears you are creating as many processes as there are messages, even when the maxProcess count is reached.
I am thinking I have leak by reusing the process slots.
There is no need to manage the processes yourself. Just use a process pool:
# before your while loop starts
from multiprocessing import Pool
pool = Pool(processes=max_process)
while 1:
...
# instead of creating a new Process
res = pool.apply_async(dosomething,
args=(dirName,uuid,guid,callback))
# after the while loop has finished
# -- wait to finish
pool.close()
pool.join()
Ways to submit jobs
Note that the Pool class supports several ways to submit jobs:
apply_async - one message at a time
map_async - a chunk of messages at a time
If messages arrive fast enough it might be better to collect several of them (say 10 or 100 at a time, depending on the actual processing done) and use map to submit a "mini-batch" to the target function at a time:
...
while True:
messages = []
# build mini-batch of messages
while len(messages) < batch_size:
... # get message
messages.append((dirName,uuid,guid,callback))
pool.map_async(dosomething, messages)
To avoid memory leaks left by dosomething you can ask the Pool to restart a process after it has consumed some number of messages:
max_tasks = 5 # some sensible number
Pool(max_processes, maxtasksperchild=max_tasks)
Going distributed
If with this approach the memory capacity is still exceeded, consider using a distributed approach i.e. add more machines. Using Celery that would be pretty straight forward, coming from the above:
# tasks.py
#task
def dosomething(...):
... # same code as before
# driver.py
while True:
... # get messages as before
res = somefunc.apply_async(args=(dirName,uuid,guid,callback))

Python multithreading without a queue working with large data sets

I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.
Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.
This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?
queue=Queue.Queue()
def worker():
while True:
task=queue.get()
try:
subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
docRoot=docRoot,
task=task
)],shell=True)
except:
pass
with lock:
stats['done']+=1
if int(time.time())!=stats.get('now'):
stats.update(
now=int(time.time()),
percent=(stats.get('done')/stats.get('total'))*100,
ps=(stats.get('done')/(time.time()-stats.get('start')))
)
print("\r {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
percent=stats.get('percent'),
progress=('='*int((23*stats.get('percent'))/100))+'>',
persec=stats.get('ps'),
done=int(stats.get('done')),
total=stats.get('total'),
eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
),end='')
queue.task_done()
for i in range(32):
workers=threading.Thread(target=worker)
workers.daemon=True
workers.start()
try:
with open(csvFile,'rb') as fh:
try:
dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
fh.seek(0)
reader=csv.reader(fh,dialect)
headers=reader.next()
except csv.Error as e:
print("\rERROR[CSV] {error}\n".format(error=e))
else:
while True:
try:
data=reader.next()
except csv.Error as e:
print("\rERROR[CSV] - Line {line}: {error}\n".format( line=reader.line_num, error=e))
except StopIteration:
break
else:
stats['total']+=1
queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
queue.join()

32 threads is probably overkill unless you have some humungous hardware available.
The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1
which comes to either 7 or 15 on most hardware.
The simplest way would be to start 7 threads passing each thread an "offset" as a parameter.
i.e. a number from 0 to 7.
Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.
This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.
I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.

I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.
Example:
from multiprocessing import Pool
def job(args):
# do some work
inputs = [...] # define your inputs
Pool().map(job, inputs)
I leave it up to you to fill in the blanks to meet your specific requirements.
See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.

Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:
import multiprocessing
from multiprocessing import Pool
def multiplyxy(x,y):
return x*y
def funkytuple(t):
"""
Breaks a tuple into a function to be called and a tuple
of arguments for that function. Changes that new tuple into
a series of arguments and passes those arguments to the
function.
"""
f = t[0]
t = t[1]
return f(*t)
def processparallel(func, arglist):
"""
Takes a function and a list of arguments for that function
and proccesses in parallel.
"""
parallelarglist = []
for entry in arglist:
parallelarglist.append((func, tuple(entry)))
cpu_count = int(multiprocessing.cpu_count() - 1)
pool = Pool(processes = cpu_count)
database = pool.map(funkytuple, parallelarglist)
pool.close()
return database
#Necessary on Windows
if __name__ == '__main__':
x = [23, 23, 42, 3254, 32]
y = [324, 234, 12, 425, 13]
i = 0
arglist = []
while i < len(x):
arglist.append([x[i],y[i]])
i += 1
database = processparallel(multiplyxy, arglist)
print(database)

Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?
myq = Queue.Queue(maxsize=64)
Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.
EDIT
This is current script. It appears that it is reading the
entire csv file into queue and doing a queue.join(). Is
it correct that it is loading the entire csv into a queue
then spawning the threads?
The indentation is messed up in your post, so have to guess some, but:
The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.
To get better answers, supply better info ;-)
ANOTHER EDIT
Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.
BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.
And note what #JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.