I am currently currently a python code that I believe has some memory issue.
I reasoned it out last night, I believe it is due to some overheads that's been being constantly generated but I don't know what exactly they are. I have tried gccollect() method and it didn't really work. The good thing about my python code is that at each run, I can save the parameters of the partially trained model as pickle. And hopefully, I will shut down the script, and start a new python script and load my saved pickle to continue training.
My strategy is to do the following in a shell script:
for i in 1000:
run `python train_model.py` # this will give me allows a fresh environment with little memory overhead
# during the runs of `train_model.py`, some memory overhead has been accumulated
torch.save(train_model.state_dict())
shut down the train_model.py script
# I am hoping that for the next run of this python script, I can get a fresh environment
Do you think this will work? How do I do this in a shell script?
Related
I'm trying to run a python script in which part of the code is going to be parallelized according to some SLURM environment variables. I don't think the exact code is important, but for reference, I would like to use this to train my networks.
Now, the problem is that I need to run my script via srun, however this will spawn multiple parallel instances of my script which I don't want.
The most basic example would be this:
#!/bin/sh
#SBATCH -N 2
#SBATCH --ntasks=2
srun python myscript.py
Now I will have 2 nodes and 2 tasks, meaning that when I run python myscript.py there will be 2 instances of myscript.py running in parallel.
However, this is not what I want. I would like there to be only one instance of myscript.py running, however it should have access to the environment variables set by srun, and leave it to the python script to properly distribute the resources.
settings srun --ntasks=1 does not work, since then the script will only 'see' one of the nodes.
Is it possible to use srun to run a single instance of the script while it still has 'access' to the SLURM environment variables? I've looked at options such as --exclusive and --preserve-env, but they do not seem to help me in this case.
Turns out that Hristo Iliev was right in the comments, to use the SlurmClusterResolver properly multiple jobs need to run in parallel.
This can be a bit confusing as everything will be printed multiple times because everything is run in parallel, but this is normal.
However, my initial confusion and my assumption that it must be done as stated in the original question came from TensorFlow reporting out of memory errors whenever I tried to use the MultiWorkerMirrored strategy, whereas I knew that without this the model fitted perfectly within the available memory.
Somewhere I made a call to tf.config.get_visible_devices("GPU") in my code. Now in order for TensorFlow to get the GPUs, it will allocate them, and by default does so by filling up the complete GPU memory. However, since all scripts are running in parallel, each script will try to do this for themselves (since this is done outside the scope of the strategy), resulting in out of memory (OOM) errors.
After removing this piece of code, everything ran fine.
Suggestion for people that might stumble upon this post in the future:
- Scripts are supposed to be run in parallel, you will see multiple times the same outputs
- Make sure that everything is done under strategy.scope(), i.e. model compiling, data generation setup (using tf.data)
- Pay special attention to saving the model; only the 'main' worker should save the model to the real save file, the others should write to temporary files see here
If you get out of memory errors; make sure that there is not some piece of code that allocates all the GPUs outside of the scope. This can be some initiation somewhere by TensorFlow, but if this is present in all scripts it will cause OOM errors. A handy way to test this is to use tf.config.experimental.set_memory_growth, to allow memory growth instead of full memory allocation.
In my code, I used the get_task_info() function of tf.distribute.cluster_resolver.SlurmClusterResolver, and only ran functions that allocate memory when the task number was 0, the main worker.
(Above functions and comments are based on TensorFlow 2.2.0 and Python 3.7.7)
I am trying to find ways to make better use of my time while programming.
I have a python script that does some heavy work (it can take hours) to finish. Now, most of the work it does is network related, so i have plenty of cpu resources to spare.
If the script was a C binary executable, it would be fine to git checkout onto a different branch and do extra work, I could even modify the binary in disk as it has been copied to ram, so until it finishes running I won't affect program output.
But python scripts are translated, not compiled. What happens if I start tampering with the source file, can i corrupt the programs output, or is the text file and associated imports copied to RAM, allowing me to tamper with the source with no risk of changing the behaviour of the running program?
In general, if you have a single Python file which you run as a script, you're fine. When you run the file, it is compiled into bytecode which is then executed. You can change the original script at this point and nothing breaks.
However, we can deliberately break it by writing some horrible but legal code like this:
horrible.py:
from time import sleep
sleep(10)
import silly
silly.thing()
silly.py:
def thing():
print("Wow!")
You can run horrible.py and while it is running you can edit silly.py on disk to make it do something else. When silly.py is finally imported, the updated version will be loaded.
A workaround is to put all your imports at the top of the file, which you probably do anyway.
When a python program is run it is compiled (kinda, more like translated) into a .pyc file that is then run by the python interpreter. When you change a file it should NOT affect the code if it is already running.
Here is a related stackoverflow answer. What will happen if I modify a Python script while it's running?
Why not have another working directory where you make your modifications? Is there a lot of ancillary data or something that makes it hard to set up a working directory? I.e. if your working directory is A, git clone A B, and then work in B. When you're done, you can pull the changes back from B to A:
git remote add B ../B
git pull B master
I am running a Python 2.7 script in Spyder that can run as long as I need, but I am getting a MemoryError. Short runs of this script do not do this, and I have no reason to believe that the run is substantially different as the runtime grows arbitrarily long. I'd like to go into Spyder's settings to allow more memory to be consumed, but I cannot find where to go in the settings.
Forgive me if it's a silly question. I'm new to Python and scripting languages. Now I'm using Komodo Edit to code and run Python programs. Each time I run it, I have to wait until the program finished execution to see my "print" results in the middle. I'm wondering if it's possible to see realtime outputs as in a console. Maybe it is caused by some preference in Komodo?
Another question is that I know in the interpreter, when I store some variables it will remember what I stored, like in a Matlab workspace. But in Komodo Edit, every time the program runs from the beginning and store no temporary variables for debugging. For example if I need to read in some large file and do some operations, every time I have to read it in again which takes a lot of time.
Is there a way to achieve instant output or temporary variable storage without typing every line into the interpreter directly, when using other environments like Komodo?
The Python output is realtime.
If your output is not realtime, this is likely an artefact of Komodo Edit. Run your script outside of Komodo.
And Python, as any programming language, starts from scratch when you start it. How would it otherwise work?
If you want a interpreter-like situation you can use import pdb;pdb.set_trace() in your script. That will give you an interpreter prompt for debugging.
I've looked at some questions about profiling memory usage in Python programs, but so far haven't been able to get anything to work. My program must run as root (it opens a TUN/TAP device).
First, I tried heapy; unfortunately this didn't work for me. Every time my code tried to execute hpy().heap() the program froze. Not wanting to waste too much timed I decided to try valgrind.
I tried valgrind with massif:
# valgrind --tool=massif ./my_prog.py --some-options value
I think the issue is related to profiling Python programs. I tried my program (which runs as root) and no massif output file was generated. I also wasn't able to generate an output file with another Python program (which doesn't run as root). However, a simple C test program worked fine and produced the massif file.
What are the issues preventing Valgrind and massif from working correctly with Python programs?
Instead of having the script launch the interpreter, directly calling it as a parameter to Valgrind solves the problem.
valgrind --tool=massif python my_script.py