OS starts killing processes when multi-threaded python process runs

OS starts killing processes when multi-threaded python process runs - python

This is the strangest thing!
I have a multi-threaded client application written in Python. I'm using threading to concurrently download and process pages. I would use the cURL multi-handle except that the bottleneck is definitely the processor (not the bandwidth) in this application so it is more efficient to use a thread pool.
I have a 64b i7 rocking 16GB RAM. Beefy. I launch 80 threads while listening to Pandora and trolling Stackoverflow and BAM! The parent process sometimes ends with the message
Killed
Other times a single page (which is it's own process in Chrome) will die. Other times the whole browser crashes.
If you want to see a bit of code here is the gist of it:
Here is the parent process:
def start( ):
while True:
for url in to_download:
queue.put( ( url, uri_id ) )
to_download = [ ]
if queue.qsize( ) < BATCH_SIZE:
to_download = get_more_urls( BATCH_SIZE )
if threading.activeCount( ) < NUM_THREADS:
for thread in threads:
if not thread.isAlive( ):
print "Respawning..."
thread.join( )
threads.remove( thread )
t = ClientThread( queue )
t.start( )
threads.append( t )
time.sleep( 0.5 )
And here is the gist of the ClientThread:
class ClientThread( threading.Thread ):
def __init__( self, queue ):
threading.Thread.__init__( self )
self.queue = queue
def run( self ):
while True:
try:
self.url, self.url_id = self.queue.get( )
except:
raise SystemExit
html = StringIO.StringIO( )
curl = pycurl.Curl( )
curl.setopt( pycurl.URL, self.url )
curl.setopt( pycurl.NOSIGNAL, True )
curl.setopt( pycurl.WRITEFUNCTION, html.write )
curl.close( )
try:
curl.perform( )
except pycurl.error, error:
errno, errstr = error
print errstr
curl.close( )
EDIT: Oh, right...forgot to ask the question...should be obvious: Why do my processes get killed? Is it happening at the OS level? Kernel level? Is this due to a limitation on the number of open TCP connections I can have? Is it a limit on the number of threads I can run at once? The output of cat /proc/sys/kernel/threads-max is 257841. So...I don't think it's that....
I think I've got it...OK...I have no swap space at all on my drive. Is there a way to create some swap space now? I'm running Fedora 16. There WAS swap...then I enabled all my RAM and it disappeared magically. Tailing /var/log/messages I found this error:
Mar 26 19:54:03 gazelle kernel: [700140.851877] [15961] 500 15961 12455 7292 1 0 0 postgres
Mar 26 19:54:03 gazelle kernel: [700140.851880] Out of memory: Kill process 15258 (chrome) score 5 or sacrifice child
Mar 26 19:54:03 gazelle kernel: [700140.851883] Killed process 15258 (chrome) total-vm:214744kB, anon-rss:70660kB, file-rss:18956kB
Mar 26 19:54:05 gazelle dbus: [system] Activating service name='org.fedoraproject.Setroubleshootd' (using servicehelper)

You've triggered the kernel's Out Of Memory (OOM) handler; it selects which processes to kill in a complicated fashion that tries hard to kill as few processes as possible to make the most impact. Chrome apparently makes the most inviting process to kill under the criteria the kernel uses.
You can see a summary of the criteria in the proc(5) manpage under the /proc/[pid]/oom_score file:
/proc/[pid]/oom_score (since Linux 2.6.11)
This file displays the current score that the kernel
gives to this process for the purpose of selecting a
process for the OOM-killer. A higher score means that
the process is more likely to be selected by the OOM-
killer. The basis for this score is the amount of
memory used by the process, with increases (+) or
decreases (-) for factors including:
* whether the process creates a lot of children using
fork(2) (+);
* whether the process has been running a long time, or
has used a lot of CPU time (-);
* whether the process has a low nice value (i.e., > 0)
(+);
* whether the process is privileged (-); and
* whether the process is making direct hardware access
(-).
The oom_score also reflects the bit-shift adjustment
specified by the oom_adj setting for the process.
You can adjust the oom_score file for your Python program if you want it to be the one that is killed.
Probably the better approach is adding more swap to your system to try to push off the time when the OOM-killer is invoked. Granted, having more swap doesn't necessarily mean that your system will never run out of memory -- and you might not care for the way it handles if there is a lot of swap traffic -- but it can at least get you past tight memory problems.
If you've already allocated all the space available for swap partitions, you can add swap files. Because they go through the filesystem, there is more overhead for swap files than swap partitions, but you can add them after the drive is partitioned, making it an easy short-term solution. You use the dd(1) command to allocate the file (do not use seek to make a sparse file) and then use mkswap(8) to format the file for swap use, then use swapon(8) to turn on that specific file. (I think you can even add swap files to fstab(5) to make them automatically available at next reboot, too, but I've never tried and don't know the syntax.)

You are doing a
raise SystemExit
which actually exits the Python interpreter and not the thread you are running in.

Related

Can Python version message be suppressed for child processes (Windows)?

I am using multiprocessing to calculate a large mass of data; i.e. I periodically spawn a process so that the total number of processes is equal to the number of CPU's on my machine.
I periodically print out the progress of the entire calculation... but this is inconveniently interspersed with Python's welcome messages from each child!
To be clear, this is a Windows specific problem due to how multiprocessing is handled.
E.g.
> python -q my_script.py
Python Version: 3.7.7 on Windows
Then many subsequent duplicates of the same version message print; one for each child process.
How can I suppress these?
I understand that if you run Python on the command line with a -q flag, it suppresses the welcome message; though I don't know how to translate that into my script.
EDIT:
I tried to include the interpreter flag -q like so:
multiprocessing.set_executable(sys.executable + ' -q')
Yet to no avail. I receive a FileNotFoundError which tells me I cannot pass options this way due to how they check arguments.
Anyways, here is the relevant section of code (It's an entire function):
def _parallelize(self, buffer, func, cpus):
## Number of Parallel Processes ##
cpus_max = mp.cpu_count()
cpus = min(cpus_max, cpus) if cpus else int(0.75*cpus_max)
## Total Processes to-do ##
N = ceil(self.SampleLength / DATA_MAX) # Number of Child Processes
print("N: ", N)
q = mp.Queue() # Child Process results Queue
## Initialize each CPU w/ a Process ##
for p in range(min(cpus, N)):
mp.Process(target=func, args=(p, q)).start()
## Collect Validation & Start Remaining Processes ##
for p in tqdm(range(N)):
n, data = q.get() # Collects a Result
i = n * DATA_MAX # Shifts to Proper Interval
buffer[i:i + len(data)] = data # Writes to open HDF5 file
if p < N - cpus: # Starts a new Process
mp.Process(target=func, args=(p + cpus, q)).start()
SECOND EDIT:
I should probably mention that I'm doing everything within an anaconda environment.

The message is printed on interactive startup.
A spawned process does inherit some flags from the child process.
But looking at the code in multiprocessing it does not seem possible to change these parameters from within the program.
So the easiest way to get rid of the messages should be to add the -q option to the original python invocation that starts your program.
I have confirmed that the -q flag is inherited.
So that should suppress the message for the original process and the children that it spawns.
Edit:
If you look at the implementation of set_executable, you will see that you cannot add or change arguments that way. :-(
Edit2:
You wrote:
I'm doing everything within an anaconda environment.
You you mean a virtual environment, or some kind of fancy IDE like spyder?
If you ever have a Python problem, first try reproducing it in plain CPython, running from the command line. IDE's and fancy environments like anaconda sometimes do weird things when running Python.

Reducing cpu usage in python multiprocessing without sacrificing responsiveness

I have a multiprocessing programs in python, which spawns several sub-processes and manages them (restarting them if the children identify problems, etc). Each subprocess is unique and their setup depends on a configuration file. The general structure of the master program is:
def main():
messageQueue = multiprocessing.Queue()
errorQueue = multiprocessing.Queue()
childProcesses = {}
for required_children in configuration:
childProcesses[required_children] = MultiprocessChild(errorQueue, messageQueue, *args, **kwargs)
for child_process in ChildProcesses:
ChildProcesses[child_process].start()
while True:
if local_uptime > configuration_check_timer: # This is to check if configuration file for processes has changed. E.g. check every 5 minutes
reload_configuration()
killChildProcessIfConfigurationChanged()
relaunchChildProcessIfConfigurationChanged()
# We want to relaunch error processes immediately (so while statement)
# Errors are not always crashes. Sometimes other system parameters change that require relaunch with different, ChildProcess specific configurations.
while not errorQueue.empty():
_error_, _childprocess_ = errorQueue.get()
killChildProcess(_childprocess_)
relaunchChildProcess(_childprocess)
print(_error_)
# Messages are allowed to lag if a configuration_timer is going to trigger or errorQueue gets something (so if statement)
if not messageQueue.empty():
print(messageQueue.get())
Is there a way to prevent the contents of the infinite while True loop take up 100pct CPU. If I add a sleep event at the end of the loop (e.g. sleep for 10s), then errors will take 10s to correct, ans messages will take 10s to flush.
If on the other hand, there was a way to have a time.sleep() for the duration of the configuration_check_timer, while still running code if messageQueue or errorQueue get stuff inside them, that would be nice.

Python fork: 'Cannot allocate memory' if process consumes more than 50% avail. memory

I encountered a memory allocation problem when forking processes in Python. I know the issue was already discussed in some other posts here, however I couldn't find a good solution in any of them.
Here is a sample script illustrating the Problem:
import os
import psutil
import subprocess
pid = os.getpid()
this_proc = psutil.Process(pid)
MAX_MEM = int(psutil.virtual_memory().free*1E-9) # in GB
def consume_memory(size):
""" Size in GB """
memory_consumer = []
while get_mem_usage() < size:
memory_consumer.append(" "*1000000) # Adding ~1MB
return(memory_consumer)
def get_mem_usage():
return(this_proc.memory_info()[0]/2.**30)
def get_free_mem():
return(psutil.virtual_memory().free/2.**30)
if __name__ == "__main__":
for i in range(1, MAX_MEM):
consumer = consume_memory(i)
mem_usage = get_mem_usage()
print("\n## Memory usage %d/%d GB (%2d%%) ##" % (int(mem_usage),
MAX_MEM, int(mem_usage*100/MAX_MEM)))
try:
subprocess.call(['echo', '[OK] Fork worked.'])
except OSError as e:
print("[ERROR] Fork failed. Got OSError.")
print(e)
del consumer
The script was tested with Python 2.7 and 3.6 on Arch Linux and uses psutils to keep track of memory usage. It gradually increases memory usage of the Python process and tries to fork a process using subprocess.call(). Forking fails if more then 50% of the avail. memory is consumed by the parent process.
## Memory usage 1/19 GB ( 5%) ##
[OK] Fork worked.
## Memory usage 2/19 GB (10%) ##
[OK] Fork worked.
## Memory usage 3/19 GB (15%) ##
[OK] Fork worked.
[...]
## Memory usage 9/19 GB (47%) ##
[OK] Fork worked.
## Memory usage 10/19 GB (52%) ##
[ERROR] Fork failed. Got OSError.
[Errno 12] Cannot allocate memory
## Memory usage 11/19 GB (57%) ##
[ERROR] Fork failed. Got OSError.
[Errno 12] Cannot allocate memory
## Memory usage 12/19 GB (63%) ##
[ERROR] Fork failed. Got OSError.
[Errno 12] Cannot allocate memory
## Memory usage 13/19 GB (68%) ##
[ERROR] Fork failed. Got OSError.
[Errno 12] Cannot allocate memory
[...]
Note that I had no Swap activated when running this test.
There seem to be two options to solve this problem:
Using a Swap of at least twice the size of physical memory.
Changing overcommit_memory setting: echo 1 > /proc/sys/vm/overcommit_memory
I tried the latter on my desktop machine and the above script finished without errors.
However on the Computing Cluster I'm working on I can't use any of these options.
Also forking the required processes in advance, before consuming the memory, is not an option unfortunately.
Does anybody have an other suggestion on how to solve this problem?
Thank you!
Best
Leonhard

The problem you are facing is not really Python related and also not something you could really do much to change with Python alone. Starting a forking process (executor) up front as suggested by mbrig in the comments really seems to be the best and cleanest option for this scenario.
Python or not, you are dealing with how Linux (or similar system) create new processes. Your parent process first calls fork(2) which creates a new child process as a copy of itself. It does not actually copy itself elsewhere at that time (it uses copy-on-write), nonetheless, it checks if sufficient space is available and if not fails setting errno to 12: ENOMEM -> the OSError exception you're seeing.
Yes, allowing VMS to overcommit memory can suppress this error popping up... and if you exec new program (which would also end up being smaller) in the child. It does not have to cause any immediate failures. But it sounds like possibly kicking the problem further down the road.
Growing memory (adding swap). Pushes the limit and as long twice your running process still fits into available memory, the fork could succeed. With the follow-up exec, the swap would not even need to get utilized.
There seems to be one more option, but it looks... dirty. There is another syscall vfork() which creates a new process which initially shares memory with its parent whose execution is suspended at that point. This newly created child process can only set variable returned by vfork, it can _exit or exec. As such, it is not exposed through any Python interface and if you tried (I did) loading it directly into Python using ctypes it would segfault (I presume because Python would still do something other then just those three actions mentioned after vfork and before I could exec something else in the child).
That said, you can delegate the whole vfork and exec to a shared object you load in. As a very rough proof of concept, I did just that:
#include <errno.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
char run(char * const arg[]) {
pid_t child;
int wstatus;
char ret_val = -1;
child = vfork();
if (child < 0) {
printf("run: Failed to fork: %i\n", errno);
} else if (child == 0) {
printf("arg: %s\n", arg[0]);
execv(arg[0], arg);
_exit(-1);
} else {
child = waitpid(child, &wstatus, 0);
if (WIFEXITED(wstatus))
ret_val = WEXITSTATUS(wstatus);
}
return ret_val;
}
And I've modified your sample code in the following way (bulk of the change is in and around replacement of subprocess.call):
import ctypes
import os
import psutil
pid = os.getpid()
this_proc = psutil.Process(pid)
MAX_MEM = int(psutil.virtual_memory().free*1E-9) # in GB
def consume_memory(size):
""" Size in GB """
memory_consumer = []
while get_mem_usage() < size:
memory_consumer.append(" "*1000000) # Adding ~1MB
return(memory_consumer)
def get_mem_usage():
return(this_proc.memory_info()[0]/2.**30)
def get_free_mem():
return(psutil.virtual_memory().free/2.**30)
if __name__ == "__main__":
forker = ctypes.CDLL("forker.so", use_errno=True)
for i in range(1, MAX_MEM):
consumer = consume_memory(i)
mem_usage = get_mem_usage()
print("\n## Memory usage %d/%d GB (%2d%%) ##" % (int(mem_usage),
MAX_MEM, int(mem_usage*100/MAX_MEM)))
try:
cmd = [b"/bin/echo", b"[OK] Fork worked."]
c_cmd = (ctypes.c_char_p * (len(cmd) + 1))()
c_cmd[:] = cmd + [None]
ret = forker.run(c_cmd)
errno = ctypes.get_errno()
if errno:
raise OSError(errno, os.strerror(errno))
except OSError as e:
print("[ERROR] Fork failed. Got OSError.")
print(e)
del consumer
With that, I could still fork at 3/4 of available memory reported filled.
In theory it could all be written "properly" and also wrapped nicely to integrate with Python code well, but while it seems to be one additional option. I'd still go back to the executor process.
I've only briefly scanned through the concurrent.futures.process module, but once it spawns a worker process, it does not seem to clobber it before done, so perhaps abusing existing ProcessPoolExecutor would be a quick and cheap option. I've added these close to the script top (main part):
def nop():
pass
executor = concurrent.futures.ProcessPoolExecutor(max_workers=1)
executor.submit(nop) # start a worker process in the pool
And then submit the subprocess.call to it:
proc = executor.submit(subprocess.call, ['echo', '[OK] Fork worked.'])
proc.result() # can also collect the return value

Python3: RuntimeError: can't start new thread [duplicate]

I have a site that runs with follow configuration:
Django + mod-wsgi + apache
In one of user's request, I send another HTTP request to another service, and solve this by httplib library of python.
But sometimes this service don't get answer too long, and timeout for httplib doesn't work. So I creating thread, in this thread I send request to service, and join it after 20 sec (20 sec - is a timeout of request). This is how it works:
class HttpGetTimeOut(threading.Thread):
def __init__(self,**kwargs):
self.config = kwargs
self.resp_data = None
self.exception = None
super(HttpGetTimeOut,self).__init__()
def run(self):
h = httplib.HTTPSConnection(self.config['server'])
h.connect()
sended_data = self.config['sended_data']
h.putrequest("POST", self.config['path'])
h.putheader("Content-Length", str(len(sended_data)))
h.putheader("Content-Type", 'text/xml; charset="utf-8"')
if 'base_auth' in self.config:
base64string = base64.encodestring('%s:%s' % self.config['base_auth'])[:-1]
h.putheader("Authorization", "Basic %s" % base64string)
h.endheaders()
try:
h.send(sended_data)
self.resp_data = h.getresponse()
except httplib.HTTPException,e:
self.exception = e
except Exception,e:
self.exception = e
something like this...
And use it by this function:
getting = HttpGetTimeOut(**req_config)
getting.start()
getting.join(COOPERATION_TIMEOUT)
if getting.isAlive(): #maybe need some block
getting._Thread__stop()
raise ValueError('Timeout')
else:
if getting.resp_data:
r = getting.resp_data
else:
if getting.exception:
raise ValueError('REquest Exception')
else:
raise ValueError('Undefined exception')
And all works fine, but sometime I start catching this exception:
error: can't start new thread
at the line of starting new thread:
getting.start()
and the next and the final line of traceback is
File "/usr/lib/python2.5/threading.py", line 440, in start
_start_new_thread(self.__bootstrap, ())
And the answer is: What's happen?
Thank's for all, and sorry for my pure English. :)

The "can't start new thread" error almost certainly due to the fact that you have already have too many threads running within your python process, and due to a resource limit of some kind the request to create a new thread is refused.
You should probably look at the number of threads you're creating; the maximum number you will be able to create will be determined by your environment, but it should be in the order of hundreds at least.
It would probably be a good idea to re-think your architecture here; seeing as this is running asynchronously anyhow, perhaps you could use a pool of threads to fetch resources from another site instead of always starting up a thread for every request.
Another improvement to consider is your use of Thread.join and Thread.stop; this would probably be better accomplished by providing a timeout value to the constructor of HTTPSConnection.

You are starting more threads than can be handled by your system. There is a limit to the number of threads that can be active for one process.
Your application is starting threads faster than the threads are running to completion. If you need to start many threads you need to do it in a more controlled manner I would suggest using a thread pool.

I was running on a similar situation, but my process needed a lot of threads running to take care of a lot of connections.
I counted the number of threads with the command:
ps -fLu user | wc -l
It displayed 4098.
I switched to the user and looked to system limits:
sudo -u myuser -s /bin/bash
ulimit -u
Got 4096 as response.
So, I edited /etc/security/limits.d/30-myuser.conf and added the lines:
myuser hard nproc 16384
myuser soft nproc 16384
Restarted the service and now it's running with 7017 threads.
Ps. I have a 32 cores server and I'm handling 18k simultaneous connections with this configuration.

I think the best way in your case is to set socket timeout instead of spawning thread:
h = httplib.HTTPSConnection(self.config['server'],
timeout=self.config['timeout'])
Also you can set global default timeout with socket.setdefaulttimeout() function.
Update: See answers to Is there any way to kill a Thread in Python? question (there are several quite informative) to understand why. Thread.__stop() doesn't terminate thread, but rather set internal flag so that it's considered already stopped.

I completely rewrite code from httplib to pycurl.
c = pycurl.Curl()
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, CONNECTION_TIMEOUT)
c.setopt(pycurl.TIMEOUT, COOPERATION_TIMEOUT)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.POST, 1)
c.setopt(pycurl.SSL_VERIFYHOST, 0)
c.setopt(pycurl.SSL_VERIFYPEER, 0)
c.setopt(pycurl.URL, "https://"+server+path)
c.setopt(pycurl.POSTFIELDS,sended_data)
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.perform()
something like that.
And I testing it now. Thanks all of you for help.

If you are tying to set timeout why don't you use urllib2.

I'm running a python script on my machine only to copy and convert some files from one format to another, I want to maximize the number of running threads to finish as quickly as possible.
Note: It is not a good workaround from an architecture perspective If you aren't using it for a quick script on a specific machine.
In my case, I checked the max number of running threads that my machine can run before I got the error, It was 150
I added this code before starting a new thread. which checks if the max limit of running threads is reached then the app will wait until some of the running threads finish, then it will start new threads
while threading.active_count()>150 :
time.sleep(5)
mythread.start()

If you are using a ThreadPoolExecutor, the problem may be that your max_workers is higher than the threads allowed by your OS.
It seems that the executor keeps the information of the last executed threads in the process table, even if the threads are already done. This means that when your application has been running for a long time, eventually it will register in the process table as many threads as ThreadPoolExecutor.max_workers

As far as I can tell it's not a python problem. Your system somehow cannot create another thread (I had the same problem and couldn't start htop on another cli via ssh).
The answer of Fernando Ulisses dos Santos is really good. I just want to add, that there are other tools limiting the number of processes and memory usage "from the outside". It's pretty common for virtual servers. Starting point is the interface of your vendor or you might have luck finding some information in files like
/proc/user_beancounters

Python: How to trigger multiple process at same instant

I am trying to run a process that does a http POST which in turn will send an alert(time taken to send an alert is in nano second) to a server. I am trying to test the capacity of the server in handling alerts in milliseconds. As per the given standard, the server is said to handle 6000 alerts/second.
I created a piece of code using multiprocessing module, which sends 6000 alerts, but I am using a for loop and hence the time taken to execute the for loop exceeds more than a second. And hence all the 6000 process are not triggered at SAME INSTANT.
Is there a way to trigger multiple(N number) process at same instant?
This is my code: flowtesting.py which is a library. And this is followed by my script after '####'
import json
import httplib2
class flowTesting():
def init(self, companyId, deviceIp):
self.companyId = companyId
self.deviceIp = deviceIp
def generate_savedSearchName(self, randNum):
self.randMsgId = randNum
self.savedSearchName = "TEST %s risk31 more than 3" % self.randMsgId
def def_request_body_dict(self):
self.reqBody_dict = \
{ "Header" : {"agid" : "Agent1",
"mid": self.randMsgId,
"ts" : 1253125001
},
"mp":
{
"host" : self.deviceIp,
"index" : self.companyId,
"savedSearchName" : self.savedSearchName,
}
}
self.req_body = json.dumps(self.reqBody_dict)
def get_default_hdrs(self):
self.hdrs = {'Content-type': 'application/json',
'Accept-Language': 'en-US,en;q=0.8'}
def send_request(self, sIp, method="POST"):
self.sIp = sIp
self.url = "http://%s:8080/agent/splunk/messages" % self.sIp
http_cli = httplib2.Http(timeout=180, disable_ssl_certificate_validation=True)
rsp, rsp_body = http_cli.request(uri=self.url, method=method, headers=self.hdrs, body=self.req_body)
print "rsp: %s and rsp_body: %s" % (rsp, rsp_body)
# My testScript
from flowTesting import flowTesting
import random
import multiprocessing
deviceIp = "10.31.421.35"
companyId = "CPY0000909"
noMsgToBeSent = 1000
sIp = "10.31.44.235"
uniq_msg_id_list = random.sample(xrange(1,10000), noMsgToBeSent)
def runner(companyId, deviceIp, uniq_msg_id):
proc = flowTesting(companyId, deviceIp)
proc.generate_savedSearchName(uniq_msg_id)
proc.def_request_body_dict()
proc.get_default_hdrs()
proc.send_request(sIp)
process_list = []
for uniq_msg_id in uniq_msg_id_list:
savedSearchName = "TEST-1000 %s risk31 more than 3" % uniq_msg_id
process = multiprocessing.Process(target=runner, args=(companyId,deviceIp,uniq_msg_id,))
process.start()
process.join()
process_list.append(process)
print "Process list: %s" % process_list
print "Unique Message Id: %s" % uniq_msg_id_list

Making them all happen in the same instant is obviously impossible—unless you have a 6000-core machine and an OS kernel whose scheduler is able to handle them all perfectly (which you don't), you can't get 6000 pieces of code running at once.
And, even if you did, what they're all trying to do is to send a message on a socket. Even if your kernel was that insanely parallel, unless you have 6000 separate NICs, they're going to end up serialized in the NIC buffer. That's the way IP works: one packet after another. And of course there are all the routers on the path, the server's NIC, the server's OS, etc. And even if IP doesn't get in the way, bytes take time to transfer over a cable. So the only way to do this at the same instant, even in theory, would be to have 6000 NICs on each side and wire them up directly to each other with identical fiber.
However, you don't really need them in the same instant, just closer to each other than they are. You didn't show us your code, but presumably you're just starting 6000 Processes that all immediately try to send a message. That means you're including the process startup time—which can be pretty slow (especially on Windows)—in the skew time.
You can reduce that by using threads instead of processes. That may seem counterintuitive, but Python is pretty good at handling I/O-bound threads, and every modern OS is very good at starting new threads.
But really, what you need is a Barrier on your threads or processes, to let all of them complete all the setup work (including process startup) before any of them try to do any work.
It still probably won't be tight enough, but it will be a lot tighter than you probably have right now.
The next limit you're going to face is context-switching time. Modern OSs are pretty good at scheduling, but not 6000-simultaneous-tasks good. So really, you want to reduce this to N processes, each one just spamming 6000/N connections sequentially as fast as possible. That will get them into the kernel/NIC much faster than trying to do 6000 at once and making the OS do the serialization for you. (In fact, on some platforms, depending on your hardware, you might actually be better off with one process doing 6000 in a row than N doing 6000/N. Test it both ways.)
There's still some overhead for the socket library itself. To get around that, you want to pre-craft all of the IP packets, then create a single raw socket and spam those packets. Send the first packet from each connection, then the second packet from each connection, etc.

You need to use an inter-process synchronization primitive. On Linux you would use a Sys-V semaphore, on Windows you would use a Win32 event.
Your 6000 processes would wait on this semaphore/event, and from a different process you would trigger it, thus releasing all your 6000 processes from their waiting state to a ready state, and then the OS would start executing them as quickly as possible.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.