Spyder hangs on calling random.uniform()

Spyder hangs on calling random.uniform() - python

My program frequently access random numbers. I initiate my random number generator via:
import random
random.seed(1)
I'm calling random.uniform() a lot in code for an evoluationary model (biology) and it repeatidly hangs (doing nothing for 20 minutes and then I stop it) after a while. While it hangs Python is using my CPU with 20%-30% (I have four cores). At the same time it's using 10GB Ram (I am having a lot of data).
Can I do something to make the default random library not hang or is there another random libary I can use?
I'm running Spyder 4.2.5 with Python 3.8 on Windows 10. (The problem already existed with an earlier version of Spyder and I installed Spyder 4.2.5 from skretch)

Just speculating, but the default random module definitely shouldn't do that, so I suspect either
there's something wrong with your Python install itself
your system doesn't have enough entropy (or you are exhausting the pool by making a massive number of sequential calls to it, rather than simply using the base value) and is relying on /dev/random (which blocks until data is available) .. see related
differences between random and urandom
https://www.python.org/dev/peps/pep-0524/
Try calling os.random with and without the blocking flag
import os
os.getrandom(1024, flags=os.GRND_NONBLOCK) # raise for low entropy

Related

How to efficiently run multiple Pytorch Processes / Models at once ? Traceback: The paging file is too small for this operation to complete

Background
I have a very small network which I want to test with different random seeds.
The network barely uses 1% of my GPUs compute power so i could in theory run 50 processes at once to try many different seeds at once.
Problem
Unfortunately i can't even import pytorch in multiple processes. When the nr of processes exceeds 4 I get a Traceback regarding a too small paging file.
Minimal reproducable code§ - dispatcher.py
from subprocess import Popen
import sys
procs = []
for seed in range(50):
procs.append(Popen([sys.executable, "ml_model.py", str(seed)]))
for proc in procs:
proc.wait()
§I increased the number of seeds so people with better machines can also reproduce this.
Minimal reproducable code - ml_model.py
import torch
import time
time.sleep(10)
Traceback (most recent call last):
File "ml_model.py", line 1, in <module>
import torch
File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
import torch
File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
raise err
Further Investigation
I noticed that each process loads a lot of dll's into RAM. And when i close all other programs which use a lot of RAM i can get up to 10 procesess instead of 4. So it seems like a resource constraint.
Questions
Is there a workaround ?
What's the recommended way to train many small networks with pytorch on a single gpu ?
Should i write my own CUDA Kernel instead, or use a different framework to achieve this ?
My goal would be to run around 50 processes at once (on a 16GB RAM Machine, 8GB GPU RAM)

I've looked a bit into this tonight. I don't have a solution (edit: I have a mitigation, see the edit at end), but I have a bit more information.
It seems the issue is caused by NVidia fatbins (.nv_fatb) being loaded into memory. Several DLLs, such as cusolver64_xx.dll, torcha_cuda_cu.dll, and a few others, have .nv_fatb sections in them. These contain tons of different variations of CUDA code for different GPUs, so it ends up being several hundred megabytes to a couple gigabytes.
When Python imports 'torch' it loads these DLLs, and maps the .nv_fatb section into memory. For some reason, instead of just being a memory mapped file, it is actually taking up memory. The section is set as 'copy on write', so it's possible something writes into it? I don't know. But anyway, if you look at Python using VMMap ( https://learn.microsoft.com/en-us/sysinternals/downloads/vmmap ) you can see that these DLLs are committing huge amounts of committed memory for this .nv_fatb section. The frustrating part is it doesn't seem to be using the memory. For example, right now my Python.exe has 2.7GB committed, but the working set is only 148MB.
Every Python process that loads these DLLs will commit several GB of memory loading these DLLs. So if 1 Python process is wasting 2GB of memory, and you try running 8 workers, you need 16GB of memory to spare just to load the DLLs. It really doesn't seem like this memory is used, just committed.
I don't know enough about these fatbinaries to try to fix it, but from looking at this for the past 2 hours it really seems like they are the issue. Perhaps its an NVidia problem that these are committing memory?
edit: I made this python script: https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5
Get it and install its pefile dependency ( python -m pip install pefile ).
Run it on your torch and cuda DLLs. In OPs case, command line might look like:
python fixNvPe.py --input=C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\*.dll
(You also want to run this wherever your cusolver64_*.dll and friends are. This may be in your torch\lib folder, or it may be, eg, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vXX.X\bin . If it is under Program Files, you will need to run the script with administrative privileges)
What this script is going to do is scan through all DLLs specified by the input glob, and if it finds an .nv_fatb section it will back up the DLL, disable ASLR, and mark the .nv_fatb section read-only.
ASLR is 'address space layout randomization.' It is a security feature that randomizes where a DLL is loaded in memory. We disable it for this DLL so that all Python processes will load the DLL into the same base virtual address. If all Python processes using the DLL load it at the same base address, they can all share the DLL. Otherwise each process needs its own copy.
Marking the section 'read-only' lets Windows know that the contents will not change in memory. If you map a file into memory read/write, Windows has to commit enough memory, backed by the pagefile, just in case you make a modification to it. If the section is read-only, there is no need to back it in the pagefile. We know there are no modifications to it, so it can always be found in the DLL.
The theory behind the script is that by changing these 2 flags that less memory will be committed for the .nv_fatb, and more memory will be shared between the Python processes. In practice, it works. Not quite as well as I'd hope (it still commits a lot more than it uses), so my understanding may be flawed, but it significantly decreases memory commit.
In my limited testing I haven't ran into any issues, but I can't guarantee there are no code paths that attempts to write to that section we marked 'read only.' If you start running into issues, though, you can just restore the backups.
edit 2022-01-20:
Per NVIDIA: "We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature ."
This should certainly help. If it's not enough without ASLR as well then the script should still work

For my case system is already set to system managed size, yet I have same error, that is because I pass a big sized variable to multiple processes within a function. Likely I need to set a very large paging file as Windows cannot create it on the fly, but instead opt out to reduce number of processes as it is not an always to be used function.
If you are in Windows it may be better to use 1 (or more) core less than total number of pysical cores as multiprocessing module in python in Windows tends to get everything as possible if you use all and actually tries to get all logical cores.
import multiprocessing
multiprocessing.cpu_count()
12
# I actually have 6 pysical cores, if you use this as base it will likely hog system
import psutil
psutil.cpu_count(logical = False)
6 #actual number of pysical cores
psutil.cpu_count(logical = True)
12 #logical cores (e.g. hyperthreading)
Please refer to here for more detail:
Multiprocessing: use only the physical cores?

Well, i managed to resolve this.
open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

Following up on #chris-obryan's answer (I would comment but have no reputation), I've found that memory utilisation drops pretty sharply some time in to training with their fix applied (in orders of roughly the mentioned 2GB per process).
To eek out some more performance it may be worth monitoring memory utilisation and spawning a new instance of the model when these drops in memory occur, leaving enough space (~3 or 4 GB to be safe) for a bit of overhead.
I was seeings ~28GB of RAM utilised during the setup phase, which dropped to about 14GB after iterating for a while.
(Note that my use case is a little different here as I'm bottlenecked by host<->device transfers due to optimising with a GA, as a reasonable amount of CPU bound processing needs to occur after each generation, so this could play in to it. I am also using concurrent.futures.ProcessPoolExecutor() rather than manually using subprocesses)

I have changed 'num_workers = 10' to 'num_workers = 1'. It helped me to solve the problem.

To fix this problem, I updated the CUDA 11.8.0 version and PyTorch to the 11.6 cudatoolkit version with PyTorch 1.9.1. Using conda:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
Thanks to #chris-obryan I understood the problem and thought an update was available already. I measured the memory consumption before and after the updates, dropping sharply.

Since it seems that each import torch loads a bunch of fat DLLs (thanks #chris-obryan), I tried changing this:
import torch
if __name__ == "__main__":
# multiprocessing stuff, paging file errors
to this...
if __name__ == "__main__":
import torch
# multiprocessing stuff
And it worked well (because when the subprocesses are created __name__ is not "__main__").
Not an elegant solution, but perhaps useful to someone.

Is it possible to initialise a module before running a python program?

I wrote a python program which uses a module (pytesseract, specifically) and I notice it takes a few seconds to import the module once I run it. I am wondering if there is a way to initialise the module before running the main program in order to cut the duration of the actual program by a few seconds. Any suggestions?

One possible solution for slow startup time would be to split your program into two parts--one part that is always running as a daemon or service and another that communicates with it to process individual tasks.
As a quick answer without more info, pytesseract also imports (if they are installed) PIL, numpy, and pandas. If you don't need these, you could uninstall them to reduce load time.

I presume that you need to start your application multiple times with different arguments and you don't want to waste time on imports every time, right?
You can wrap actual code in while True: and use input() to get new arguments. Or read arguments from the file.

Loading time shared library too large

I'm connecting one lib(*.so) with ctypes. However, the loading time is very large. That is very slow.
What technique can I use to improve performance?
My module will always run at the prompt. Will run a command at a time.
> $./myrunlib.py fileQuestion fileAnswer
# again
> $./myrunlib.py fileQuestion fileAnswer
code:
from ctypes import *
drv = cdll.LoadLibrary('/usr/lib/libXPTO.so')

Either you've got a strange bug which makes your library load extremely slowly when used by a Python program (which I find rather unlikely), or the loading take the time it takes (maybe because the library does a large initialization task upon being loaded).
In the latter case your only option seems to be to prevent any restarts of your Python program. Let it run in a loop which reads all tasks from stdin (or any other pipe or socket or maybe even from job files) instead of from the command line.

Stop a program before it uses too much memory

I'm working on a Python program which sometimes fills up a list with millions of items. The computer (Ubuntu) starts swapping and the debugger (Eclipse) becomes unresponsive.
Is it possible to add a line in the cycle that checks how much memory is being used, and interrupts the execution, so I can check what's going on?
I'm thinking about something like:
if usedmemory() > 1000000000:
pass # with a breakpoint here
but I don't know what used memory() could be.

This is highly dependant on the machine you're running Python on. Here's a SO answer for a way to do it on Linux https://stackoverflow.com/a/278271/541208, but the other answer there offers a more platform independant solution: https://stackoverflow.com/a/2468983/541208: The psutil library, which you can install via pip install psutil:
>>> psutil.virtual_memory()
vmem(total=8374149120L, available=2081050624L, percent=75.1, used=8074080256L, free=300068864L, active=3294920704, inactive=1361616896, buffers=529895424L, cached=1251086336)
>>> psutil.swap_memory()
swap(total=2097147904L, used=296128512L, free=1801019392L, percent=14.1, sin=304193536, sout=677842944)
So you'd look at the percent of the available memory and kill your process depending on how much memory it has been using

Running Bulk Synchronous Parallel Model (BSP) in Python

The BSP Parallel Programming Model has several benefits - the programmer need not explicitly care about synchronization, deadlocks become impossible and reasoning about speed becomes much easier than with traditional methods. There is a Python interface to the BSPlib in the SciPy:
import Scientific.BSP
I wrote a little program to test BSP. The Program is a simple random experiment which "calculates" the probalbility that throwing n dice yields a sum of k:
from Scientific.BSP import ParSequence, ParFunction, ParRootFunction
from sys import argv
from random import randint
n = int(argv[1]) ; m = int(argv[2]) ; k = int(argv[3])
def sumWuerfe(ws): return len([w for w in ws if sum(w)==k])
glb_sumWuerfe= ParFunction(sumWuerfe)
def ausgabe(result): print float(result)/len(wuerfe)
glb_ausgabe = ParRootFunction(output)
wuerfe = [[randint(1,6) for _ in range(n)] for _ in range(m)]
glb_wuerfe = ParSequence(wuerfe)
# The parallel calc:
ergs = glb_sumWuerfe(glb_wuerfe)
# collecting the results in Processor 0:
ergsGesamt= results.reduce(lambda x,y:x+y, 0)
glb_output(ergsGesamt)
The program works fine, but: It uses just one process!
My Question: Anyone knows how to tell this Pythonb-BSP-Script to use 4 (or 8 or 16) Processes? I thought this BSP Implementation woould use MPI, but starting the script via mpiexe -n 4 randExp.py doesnt work.

A minor thing, but Scientific Python != SciPy in your question...
If you download the ScientificPython sources you'll see a README.BSP, a README.MPI, and a README.BSPlib. Unfortunately, there's not really much mention made of the information there on the online webpages.
The README.BSP is pretty explicit about what you need to do to get the BSP stuff working in real Parallel:
In order to use the module
Scientific.BSP using more than one
real processor, you must compile
either the BSPlib or the MPI
interface. See README.BSPlib and
README.MPI for installation details.
The BSPlib interface is probably more
efficient (I haven't done extensive
tests yet), and allows the use of the
BSP toolset, on the other hand MPI is
more widely available and might thus
already be installed on your machine.
For serious use, you should probably
install both and make comparisons for
your own applications. Application
programs do not have to be modified to
switch between MPI and BSPlib, only
the method to run the program on a
multiprocessor machine must be
adapted.
To execute a program in parallel mode,
use the mpipython or bsppython
executable. The manual for your MPI or
BSPlib installation will tell you how
to define the number of processors.
and the README.MPI tells you what to do to get MPI support:
Here is what you have to do to get MPI
support in Scientific Python:
1) Build and install Scientific Python
as usual (i.e. "python setup.py
install" in most cases).
2) Go to the directory Src/MPI.
3) Type "python compile.py".
4) Move the resulting executable
"mpipython" to a directory on your
system's execution path.
So you have to build more BSP stuff explicitly to take advantage of real parallelism. The good news is you shouldn't have to change your program. The reason for this is that different systems have different parallel libraries installed, and libraries that go on top of those have to have a configuration/build step like this to take advantage of whatever is available.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.