Embarrasing paralellizing a process Jupyter notebook

Embarrasing paralellizing a process Jupyter notebook - python

I have to compute a lot of integrals (more or less 8 million), which requires access to some coefficients, taking 15 Gb of ram. If I let the computation on its own it takes 24 hours, and in order to do it faster up to date I have been just taking chunks of integrals and running them on different notebooks, but this is "wasting" ram, as I would just be loading 'N' times the same information, where N is the number of notebooks that I am launching.
I am sure there should be an option for computing this faster. I have tried with p.map from the multiprocessing library in python but I have not been successful.
I have 16 CPUs available and a lot of RAM, but the computer I'm using is shared with other two users and I don't want to be wasting RAM for everyone.

Related

Efficiently select the number of multiprocessing workers

I am designing an analysis program that I would like to work seamlessly across three systems.
My home computer: 8 CPU cores and 16 GB of memory
A workstation: 80 CPU cores and 64 GB of memory
A super computing node: 28 CPU cores with 128 GB of memory
My program uses multiprocessing pool workers to run the slower steps in the analysis in parallel in order to save time. However, I've run into the issue that when I test my code I have trouble automatically setting an efficient number of sub-processes for the current system in Pool(processes=*<process number>*).
Initially I tried to fix this by checking the number of cores available (multiprocessing.cpu_count()) and dividing that by some fraction. So set to 1/2, an 8 core system would use 4, an 80 core system would use 40, etc.
However, because the extreme difference in core to memory ratio across systems this often leads to lockups, when too many workers are created for the amount of memory, or wasted time, when the number of workers isn't maximized.
What I want to know: How can I dynamically adjust the number of workers I generate across any system so that it always creates the most efficient number of workers?

How to limit CPU cores number in python (need simple decision)

Whould you prompt me easy solution, how to limit the number of CPU cores in Python 3.5.
I check it by using:
import multiprocessing
print(multiprocessing.cpu_count())
When I run tensorflow with 1200 1st layer neurons (abslolutely identical dataset, Python and libraries version), it shows identical results on servers with 24 and 32 cores, but different - on my local machine with 12 cores (details - in my yesterday SO article1).
I know for sure, changing the CPU cores number is the key to solving the problem, since when I decrease 1st layer neurons, 3 machines show indentical results.
Probably, when processing large amount of data, calculating process slightly depends on CPU power.

Solved the problem by adding in the very beginning of my code:
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
Roy2012 - thank you very much.

Python multithreaded execution - not scalable

I need to run very CPU and memory - intensive python calculation (Monte-Carlo like). I benchmarked execution on development machine, can run one core due to memory (up to 9 Gb per thread).
I attempted to run the same via the server (32 cores, 256 GB RAM) using multiprocessing.Pool. Surprisingly, increasing number of threads increases runtime per core, quite dramatically. 8 threads instead of 4 run 3 times longer each core. Performance monitor shows 9 x 8 Gb max, far below maximum available.
Win Server 2008 R2, 256 GB RAM, Intel® Xeon® Processor E5-2665 x2
I know that
1. Time is spent on the function itself, in three CPU expensive steps
2. Of them first (random drawings and conversion to events) and last (c++ module for aggregation) are much less sensitive to the problem (time to run increases up to factor 2). Second step containing python matrix algebra inc scipy.linalg.blas.dgemm function can be 6 time more expensive when I run more cores. It does not consume most memory (step 1 does, after step 1 it is no more than 5 gb)
3. If I manually run the same pieces from different dos boxes, I have identical behaviour.
I need the calculation time scalable in order to improve the performance but cannot have it. Do I miss something? Python memory limitations? WinServer 2008 specific? Blas overloads problem?

You miss information about GIL. In cPython threading do not give you additional performance. It allows to run calculation when some time consuming IO operations are waiting performed in other thread.
To have performance spedup your function need to release GIL. It means that it cannot be pure python, but in Cython/C/C++ with proper configuration.

Can I run two seperate jupyter notebook files at the same time, without slowdown on a single CPU computer?

I am currently running a python function in jupyter notebook, which is taking quite some time. Python says it is running at about 98% of the CPU, however, still about 60% of my CPU is unused. Now after some googling I have found that this has to do something with threading of my processor (I am not a computer engineer so sorry if this is incorrect). However, I was wondering if can run another function in jupyter notebook, and will it take up some of that 60% unused activity, or will it divide the 99% among two functions, slowing down both functions. I hope you guys can help. If anything is unclear please let me know.
P.S. I am using a macbook pro retina late 2012 (i know), 2,5 gHZ intel core i5, with 8 gbs of ram. It has two cores and one processor.

You have an Intel Sandybridge or Ivybridge CPU. It has two physical cores with hyperthreading, so it probably appears as 4 logical cores to the OS.
Each core has its own private L1i/d and L2 cache, but L3 (and memory bandwidth) is shared between cores.
Running a separate process or threads on the other physical CPU can slow down the first one by these mechanisms:
dual-core max turbo clock speed is lower than single-core turbo.
they compete for memory bandwidth and L3 cache footprint. (i.e. more L3 cache misses).
If L3 cache misses and memory bandwidth aren't significant bottlenecks for your workload, then using both cores for separate tasks is pretty much pure win.
Running 4 threads (so the OS will have to schedule tasks onto both logical cores of each physical core) will give some slowdown, but it depends a lot more on the details of the workload. See Agner Fog's microarch guide (http://agner.org/optimize/) for the asm / cpu-architecture details of how HT statically partitions or dynamically shares various execution resources. But really just try it and see.
Probably a single thread has some stalls for cache misses and other bottlenecks other than pure throughput, so you could gain some throughput at the expense of single-core performance with hyperthreading.

Multiple Thread data generator

I have a small python script used to generate lots of data to a file, it takes about 6 mins to generate 6GB data, however, my target data size could up to 1TB, for linear calculation, it will take about 1000 mins to generate 1TB data which I think it's unacceptable for me.
So I am wondering will multiple threading help me here to short the time? and why could that be? If not, do I have other options?
Thanks!

Currently, typical hard drives can write on the order of 100 MB per second.
Your program is writing 6 GB in 6 minutes, which means the overall throughput is ~ 17 MB/s.
So your program is not pushing data to disk anywhere near the maximum rate (assuming you have a typical hard drive).
So your problem might actually be CPU-bound.
If this "back-of-the-envelope" calculation is correct, and if you have a machine with multiple processors, using multiple processes could help you generate more data quicker, which could then be sent to a single process which writes the data to disk.
Note that if you are using CPython, the most common implementation of Python, then the GIL (global interpreter lock) prevents multiple threads from running at the same time. So to do concurrent calculations, you need to use multiple processes rather than multiple threads. The multiprocessing or concurrent.futures module can help you here.
Note that if your hard drive can write 100 MB/s, it would still take ~ 160 minutes to write a 1TB to disk, and if your multiple processes generated data at a rate greater than 100 MB/s, then the extra processes would not lead to any speed gain.
Of course, your hardware may be much faster or much slower than this, so it pays to know your hardware specs.
You can estimate how fast you can write to disk using Python by doing a simple experiment:
with open('/tmp/test', 'wb') as f:
x = 'A'*10**8
f.write(x)
% time python script.py
real 0m0.048s
user 0m0.020s
sys 0m0.020s
% ls -l /tmp/test
-rw-rw-r-- 1 unutbu unutbu 100000000 2014-09-12 17:13 /tmp/test
This shows 100 MB were written in 0.511s. So the effective throughput was ~195 MB/s.
Note that if you instead call f.write in a loop:
with open('/tmp/test', 'wb') as f:
for i in range(10**7):
f.write('A')
then the effective throughput drops dramatically to just ~ 3MB/s. So how you structure your program -- even if using just a single process -- can make a big difference. This is an example of how collecting your data into fewer but bigger writes can improve performance.
As Max Noel and kipodi have already pointed out, you can also try writing to /dev/null:
with open(os.devnull, 'wb') as f:
and timing a shortened version of your current script. This will show you how much time is being consumed (mainly) by CPU computation. It's this portion of the overall run time that may be improved by using concurrent processes. If it is large then there is hope that multiprocessing may improve performance.

In all likelihood, multithreading won't help you.
Your data generation speed is either:
IO-bound (that is, limited by the speed of your hard drive), and the only way to speed it up is to get a faster storage device. The only type of parallelization that can help you is finding a way to spread your writes across multiple devices (can you use multiple hard drives?).
CPU-bound, in which case Python's GIL means you can't take advantage of multiple CPU cores within one process. The way to speed your program up is to make it so you can run multiple instances of it (multiple processes), each generating part of your data set.
Regardless, the first thing you need to do is profile your program. What parts are slow? Why are they slow? Is your process IO-bound or CPU-bound? Why?

6 mins to generate 6GB means you take a minute to generate 1 GB. A typical hard drive is capable of up to 80 - 100 MB/s throughput when new. This leaves you with approximately 6 GB / minute IO limit.
So it looks like the limiting factor is the CPU, which is good news (running more instances can help you).
However I wouldn't use multithreading for Python because of GIL. A better idea will be to run some scripts writing to different offsets in different processes or tu multiprocessing module of Python.
I would check it though with running it an writing to /dev/null to make sure you truly are CPU bound.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.