I'm following the steps here in order to use all my computing power (10-core Intel i9 CPU) and solve a one-instance abstract Pyomo model. However, it seems that the solver is just using one CPU core and it takes more than 2 days to return a solution for a 50-node input (with 10 nodes it just takes seconds.) any help about making Pyomo model run in all available CPU cores?
Thanks
Thanks to #Erwin Kalvelagen for pointing out GLPK's serial nature, after doing some experiments I migrated to Gurobi and now all my 10 CPU cores are being used by the Pyomo model.
This may just be a syntax error on your end. Try using Arduino for C++. I9 processors do not support other ones like the Five.9 servers. Beware, overrides may occur to burn out your i9.
Related
I'm using Tensorflow 2.5 to train a starGAN network for generating images (128x128 jpeg). I am using tf.keras.preprocessing.image_dataset_from_directory to load the images from the subfolders.
Additionally I am using arguments to maximize loading performance as suggested in various posts and threads such as loadedDataset.cache().repeat.prefetch
I'm also using the num_parallel_calls=tf.data.AUTOTUNE for the mapping functions for post-processing the images after loading.
While training the network on GPU the performance I am getting for GPU Utilization is in the picture attached below.
My question regarding this are:
Is the GPU utlization normal or is it not supposed to be so erratic for traning GANs?
Is there any way to make this performance more consistent?
Is there any way to improve the training performance to fully utlize the GPU?
Note that Ive logged my disk I/O also and there is no bottleneck reading/writing from the disk (nvme ssd).
The system has 32GB RAM and a RTX3070 with 8GB Vram. I have tried the running it on colab also; but the performance was similarly erratic.
It is fairly normal for utilization to be erratic like for any kind of parallelized software, including training GANs. Of course, it would be better if you could fully utilize your GPU, but writing software that does this is challenging and becomes virtually impossible when you are talking about complex applications like GANs.
Let me try to demonstrate with a trivial example. Say you have two threads, threadA and threadB. threadA is running the following python code:
x = some_time_comsuming_task()
y = get_y_from_threadB()
print(x+y)
Here threadA is performing lots of calculations to get the value for x, retrieving the value for y, and printing out the sum of x+y. Imagine threadB is also doing some kind of time consuming calculation to generate the value for y. Unless threadA is ready to retrieve y at the exact same time threadB finishes calculating it, you won't have 100% utilization of both threads for the entire duration of the program. And this is just two threads, when you have 100s of threads working together with multiple chained data dependencies, you can see how it becomes exponentially more difficult to eliminate any and all time threads spend waiting on other threads to deliver input to the next step of computation.
Trying to make your "performance more consistent" is pointless. Whether your GPU utilization went up and down (like in the graph you shared) or it stayed exactly at the average utilization for the entire execution would not change the overall execution time, which is probably the actually important metric here. Utilization is mostly useful to identify where you can optimize your code.
Fully utilize? Probably not. As explained in my answer to question one, it's going to be virtually impossible to orchestrate your GAN to completely remove bottlenecks. I would encourage you to try and improve execution time, rather than utilization, when optimizing your GAN. There's no magic setting that you're missing that will completely unlock all of your GPU's potential.
I don't have a GPU on my machine, since most of the performance recommondations on tensorflow mention only GPU, can someone confirm that e.g.
tf.data.prefetch
tf.distribute.mirroredstrategy
tf.distribute.multiworkerstrategy
Will only work with multi GPU ?
I tried it on my PC and most of the functions realy slow down the process instead of increasing it. Therefore multi CPU is no benefit here?
In case you haven't solved your problem yet, you can use Google Colab (https://colab.research.google.com) to get a GPU - there you can change a runtime to GPU or TPU.
I did not understand exactly what you are asking but let me give you a 10,000ft explanation of those. It might help you to understand what/when you should use it.
tf.data.prefetch : let's suppose that your have 2 steps while training your model. a) read data, b) process the data. While you are processing the data, you could be reading more data to make sure it is available once 'training' is done with the current batch of data. Just think about a producer/consumer model. You don't want your consumer idle while you are producing more data.
tf.distribute.mirroredstrategy : this one helps if you have a single machine with more than one GPU. It allows to train a model in "parallel" on the same machine.
tf.distribute.multiworkerstrategy : let's suppose now that you have a cluster with 5 machines. You could train your model in a distributed fashion using all of them.
This is just a simple explanation of those 3 items you mentioned here.
Even after setting tf.config.threading.set_inter_op_parallelism_threads(1) and tf.config.threading.set_intra_op_parallelism_threads(1) Keras with Tensorflow CPU (running a simple CNN model fit) on a linux machine is creating too many threads. Whatever I try it seems to be creating 94 threads while going through the fitting epochs. Have tried playing with tf.compat.v1.ConfigProto settings but nothing helps. How do I limit the number of threads?
This is why tensorflow created many threads.
Using the mentioned 2 types of parallelism (inter and intra) you have limited control over the number of threads generated by TensorFlow. The minimum number of threads that you can get by setting these two variables is N, where N is the number of cores on your cpu (I don't know if you use gpu).
intra_op_parallelism_threads = 1
inter_op_parallelism_threads = 1
Even by setting the environment variables OMP_NUM_THREADS and MKL_NUM_THREADS can't help in further reducing the number of threads.
The following discussions suggest that without changing the source code of TensorFlow, it is not possible to reduce the number threads below N.
How can I confine TensorFlow C API to use one and only one thread in total
How to disable Tensorflow's multi-threading?
How to stop TensorFlow from multi-threading
https://github.com/tensorflow/tensorflow/issues/42510
https://github.com/tensorflow/tensorflow/issues/33627
I am running a simple deep learning model on Google's colab, but it's running slower than my MacBook Air with no GPU.
I read this question and found out it's a problem because of dataset importing over the internet, but I am unable to figure out how to speed up this process.
My model can be found here. Any idea of how I can make the epoch faster?
My local machine takes 0.5-0.6 seconds per epoch and google-colabs takes 3-4 seconds
Is GPU always faster than CPU? No, why? because the speed optimization by a GPU depends on a few factors,
How much part of your code runs/executes in parallel, i.e how much part of your code creates threads that run parallel, this is automatically taken care by Keras and should not be a problem in your scenario.
Time Spent sending the data between CPU and GPU, this is where many times people falter, it is assumed that GPU will always outperform CPU, but if data being passed is too small, the time it takes to perform the computation (No of computation steps required) are lesser than breaking the data/processes into thread, executing them in GPU and then recombining them back again on the CPU.
The second scenario looks probable in your case since you have used a batch_size of 5.
classifier=KerasClassifier(build_fn=build_classifier,epochs=100,batch_size=5), If your dataset is big enough, Increasing the batch_size will increase the performance of GPU over CPU.
Other than that you have used a fairly simple model and as #igrinis pointed out that data is loaded only once from drive to memory so the problem in all theory should not be loading time because the data is on drive.
I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.
MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.
The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.