python script executes quicker in docker container than natively - python

I am executing a simple python (v 2.7.17) script which finds the square roots of numbers between 1 - 1000000. it does this 1000000 times for a single execution. This is then repeated 100 times. The output is the time taken to execute each cycle.
When I execute this script in a Linux shell, each execution time is printed one after the other. They vary, but the average across the total 100 executions is 0.126154s.
When I run the exact same script within a docker container, there is no output until the end of all 100 executions where the output for all 100 is displayed all at one. The execution times are quicker when compared to native an average of 100 docker executions is 0.095896s.
When I apply various stresses to the system when executing the script both natively and in docker, the average execution times differ greatly. When I stress the CPU, I get an average across 100 executions of
native average 0.506660s
docker average 0.190208s
I am curious as to why my python script runs quicker when in a container. Any thoughts would be greatly appreciated. Python code is:
import timeit
mycode = """
def example():
mylist = []
for x in range(1000000):
mylist.append(sqrt(x))
"""
mysetup = "from math import sqrt"
print timeit.timeit(setup = mysetup,stmt = mycode,number = 1000000)

I did more digging and found out why I had better execution times whilst running the script in a container.
When I start the script in a container on my system (4 core) it looks like a whole core or a percentage of all four is dedicated or reserved to running that container, the rest of the systems running processes are then divided up with what CPU availability is left.
When running the script natively the script has to compete with everything else running on the system. So when I applied the stress tests (stress-ng) to the CPU - each stress test is a new process where the available processor time is dividied into equal amounts for each stress process. The more stresses I applied to the system the slower the script was executing but when executing in a container this did not apply due to a large chunk of processor available all the time to the container.

Related

A single Python script involving np.linalg.eig is inexplicably taking multiple CPUs?

Note: The problem seems to be related to np.linalg.eig and eigsh and scipy.sparse.linalg.eigsh. For scripts not involving these functions, everything on the AWS box works as expected.
The most basic script I have found with the problem is:
import numpy as np
for i in range(0, num_iter):
x=np.linalg.eig(np.random.rand(1000,1000))
I'm having a very bizarre error on AWS where a basic python script that calculates eigenvalues is using 100% of all the cores (and is going no faster because of it).
Objective: Run computationally intensive python code. The code is parallel for loop, where each iteration is independent. I have two versions of this code, a basic version without multiprocessing, and one using the multiprocessing module.
Problem: The virtual machine is a c6i-series on AWS.
On my personal machine, using 6 cores is roughly ~6 times faster when using the parallelized code. Using more than 1 core with the same code on the AWS box makes the runtime slower.
Inexplicable Part:
I tried to get around this by setting up multiple copies the basic script using &, and this doesn't work either. Running n copies causes them all to be slower by a factor of 1/n. Inexplicably, a single instance of the python script uses all the cores of the machine. Unix command TOP indicates all of the CPUs being used (i.e. all of them), and AWS CPU usage monitoring confirms 100% usage of the machine. I don't see how this is possible given GIL.
Partial solution? Specifying the processor fixed the issue somewhat:
Running the commands taskset --cpu-list i my_python_script.py & for i from 1 to n, they do indeed run in parallel, and the time is independent of n (for small n). The expected CPU usage statistics on the AWS monitor are what you would expect. The speed here when using one processor was the same as when the script ran and was taking all the cores of the machine.
Note: The fact that the runtime on 1 processor is the same suggests it was really running on 1 core all along, and the others are somehow being erroneously used.
Question:
Why is my basic python script taking all the cores of the AWS machine while not going any faster? How is this error even possible? And how can I get it to run simply with multiprocessing without using this weird taskset --cpu-list work around?
I had the exact same problem on the Google Cloud Platform as well.
The basic script is very simple:
from my_module import my_np_and_scipy_function
from my_other_module import input_function
if __name__ == "__main__":
output = []
for i in range(0, num_iter):
result = my_np_and_scipy_function(kwds, param = input_function)
output.extend(result)
With multiprocessing, it is:
from my_module import my_np_and_scipy_function
if __name__ == "__main__":
pool = multiprocessing.Pool(cpu_count)
for i in range(0, num_iter):
result = pool.apply_async(my_np_and_scipy_function,kwds={"param":input_function,...},
)
results.append(result)
output = []
for x in results:
output.extend(x.get())
Numpy use multiprocessing in some random functions. So it is possible.
You can see here https://github.com/numpy/numpy/search?q=multiprocessing
Following the answers in the post, Limit number of threads in numpy, the numpy eig functions and the scripts work properly by putting the following lines of code at the top of the script:
import os
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OMP_NUM_THREADS"] = "1"

ansible runner runs to long

When I use ansible's python API to run a script on remote machines(thousands), the code is:
runner = ansible.runner.Runner(
module_name='script',
module_args='/tmp/get_config.py',
pattern='*',
forks=30
)
then, I use
datastructure = runner.run()
This takes too long. I want to insert the datastructure stdout into MySQL. What I want is if once a machine has return data, just insert the data into MySQL, then the next, until all the machines have returned.
Is this a good idea, or is there a better way?
The runner call will not complete until all machines have returned data, can't be contacted or the SSH session times out. Given that this is targeting 1000's of machines and you're only doing 30 machines in parallel (forks=30) it's going to take roughly Time_to_run_script * Num_Machines/30 to complete. Does this align with your expectation?
You could up the number of forks to a much higher number to have the runner complete sooner. I've pushed this into the 100's without much issue.
If you want max visibility into what's going on and aren't sure if there is one machine holding you up, you could run through each hosts serially in your python code.
FYI - this module and class is completely gone in Ansible 2.0 so you might want to make the jump now to avoid having to rewrite code later

Python Couchbase pipeline() kills interpreter with large batches

So I wanted to use the python driver to batch insert documents to Couchbase. However if the batch exceeds few thousand documents the python kernel gets restarted (I'm using notebook).
To reproduce this please try:
from couchbase.bucket import Bucket
cb = Bucket('couchbase://localhost/default')
keys_per_doc = 50
doc_count = 10000
docs = [dict(
[
('very_long_feature_{}'.format(i), float(i) if i % 2 == 0 else i)
for i in xrange(keys_per_doc)
] + [('id', id_)] ) for id_ in xrange(doc_count)]
def sframe_to_cb(sf, key_column, key_prefix):
pipe = cb.pipeline()
with pipe:
for r in sf:
cb.upsert(key_prefix + str(r[key_column]), r)
return 0
p = sframe_to_cb(docs, 'id', 'test_')
The fun thing is that all docs get inserted and I suppose the interpreter dies when gathering the results on the pipeline.exit method.
I don't get any error message and the notebook console just says that it has restarted the notebook.
I'm curious what is causing this behavior and if there is a way to fix it.
Obviously I can do mini-batches (up to 3000 docs in my case) but this makes it much slower if they are processed sequentially.
I cannot use multiprocessing because I run the inserts inside of celery.
I cannot use multiple celery tasks because the serialisation of batches is too expensive and could kill our redis instance.
So the questions:
What is causing the crash with large batches and is there a way to fix it?
Assuming that nothing can go wrong with upserts can I make the pipeline discard results.
Is there a different way to achieve high throughput from a single process?
Aditional info as requested in comments:
VMWarte fusion on Mac running a Ubuntu 14.04 LTS VM
The guest ubuntu has 4GB RAM, 12GB swap on SSD, 2 cores (4 threads)
The impression that doing mini batches is slower comes from watching the bucket statsistics (large batch peaks at 10K TPS smaller ones get c.a. 2K TPS)
There is a large speed up if I use multiprocessing and these batches are distributed across multiple CPUs (20-30K TPS) however I cannot do this in production because of celery limitations (I cannot use a ProcessPoolExecutor inside a celery task)
I cannot really tell when exactly does the crash happen (I'm not sure if this is relevant)

issue in trying to execute certain number of python scripts at certain intervals

I am trying to execute certain number of python scripts at certain intervals. Each script takes a lot of time to execute and hence I do not want to waste time in waiting to run them sequentially. I tired this code but it is not executing them simultaneously and is executing them one by one:
Main_file.py
import time
def func(argument):
print 'Starting the execution for argument:',argument
execfile('test_'+argument+'.py')
if __name__ == '__main__':
arg = ['01','02','03','04','05']
for val in arg:
func(val)
time.sleep(60)
What I want is to kick off by starting the executing of first file(test_01.py). This will keep on executing for some time. After 1 minute has passed I want to start the simultaneous execution of second file (test_02.py). This will also keep on executing for some time. Like this I want to start the executing of all the scripts after gaps of 1 minute.
With the above code, I notice that the execution is happening one after other file and not simultaneously as the print statements which are there in these files appear one after the other and not mixed up.
How can I achieve above needed functionality?
Using python 2.7 on my computer, the following seems to work with small python scripts as test_01.py, test_02.py, etc. when threading with the following code:
import time
import thread
def func(argument):
print('Starting the execution for argument:',argument)
execfile('test_'+argument+'.py')
if __name__ == '__main__':
arg = ['01','02','03']
for val in arg:
thread.start_new_thread(func, (val,))
time.sleep(10)
However, you indicated that you kept getting a memory exception error. This is likely due to your scripts using more stack memory than was allocated to them, as each thread is allocated 8 kb by default (on Linux). You could attempt to give them more memory by calling
thread.stack_size([size])
which is outlined here: https://docs.python.org/2/library/thread.html
Without knowing the number of threads that you're attempting to create or how memory intensive they are, it's difficult to if a better solution should be sought. Since you seem to be looking into executing multiple scripts essentially independently of one another (no shared data), you could also look into the Multiprocessing module here:
https://docs.python.org/2/library/multiprocessing.html
If you need them to run parallel you will need to look into threading. Take a look at https://docs.python.org/3/library/threading.html or https://docs.python.org/2/library/threading.html depending on the version of python you are using.

Test speed of two scripts

I'd like to test the speed of a bash script and a Python script. How would I get the time it took to run them?
If you're on Linux (or another UN*X), try time:
The time command runs the specified program command with the given
arguments. When command finishes, time writes a message to standard
error giving timing statistics about this program run. These statis-
tics consist of (i) the elapsed real time between invocation and termi-
nation, (ii) the user CPU time (the sum of the tms_utime and tms_cutime
values in a struct tms as returned by times(2)), and (iii) the system
CPU time (the sum of the tms_stime and tms_cstime values in a struct
tms as returned by times(2)).
Note that you need to eliminate outer effects - e.g. other processes using the same resources can skew the measurement.
I guess that you can use
time ./script.sh
time python script.py
At the beginning of each script output the start time and at the end of each script output the end time. Subtract the times and compare. Or use the time command if it is available as others have answered.

Categories

Resources