No console output using Keras model.fit() function

No console output using Keras model.fit() function - python

I'm following this tutorial to perform time series classifications using Transformers with Keras and TensorFlow. I'm using Windows 10 and the PyDev Eclipse plugin. Unfortunately, my program stops and the console output is completely blank every time I run the following code:
n_classes = len(np.unique(y_train))
input_shape = np.array(x_trainScaled).shape[0:]
model = build_model(n_classes,input_shape,head_size=256,num_heads=4,ff_dim=4,num_transformer_blocks=4,mlp_units=[128],mlp_dropout=0.4,dropout=0.25)
model.compile(loss="sparse_categorical_crossentropy",optimizer=keras.optimizers.Adam(learning_rate=1e-4),metrics=["sparse_categorical_accuracy"])
print(model.summary())
callbacks = [keras.callbacks.EarlyStopping(patience=100, restore_best_weights=True)]
model.fit(x_trainScaled,y_train,validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)
pathToModel = 'my/path/to/model/'
model.save(pathToModel)
Even previous warnings or print statements are completely erased and I have no idea what's going on. If I comment the model.fit(...) statement out, the program terminates and crashes with an error message resulting from a model.predict(...) call.
Any help is highly appreciated.

The solution was to transform the input data and labels to numpy arrays first. Thus, calling the fit function as follows:
model.fit(np.array(x_trainScaled),np.array(y_train),validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)
worked perfectly fine for me, as opposed to:
model.fit(x_trainScaled,y_train,validation_split=0.2,epochs=200,batch_size=64,callbacks=callbacks)

Related

load_weights does not work correctly for DDPGAgent (keras-rl)

I am using DDPGAgent class from keras-rl library for Pendulum. Der Code is almost the same as in the example: https://github.com/keras-rl/keras-rl/blob/master/examples/ddpg_pendulum.py.
Everything is OK. After "fit", if I call "agent.test", I can see the pendulum keeping standing.
But if I try to load the model again later, using load_weights:
agent = DDPGAgent( ...)
agent.compile(Adam(learning_rate=.001, clipnorm=1.), metrics=['mae'])
agent.load_weights('model.hdf5')
agent.test(env, nb_episodes=5, visualize=False )
The results are very bad, with rewards smaller than -1000.
The same problem is also reported in https://github.com/keras-rl/keras-rl/issues/300. But no solution is provided.
So what did I wrong with load_weights?
Thank you in advance!
Xinyu
UPDATE:
I found the problem: a processor (inherited from WhiteningNormalizerProcessor) is defined in DDPGAgent to normalize the input data. If we remove the processor (i.e. set processor=None in DDPGAgent constructor), or just overwrite the method "process_state_batch", so that it just returns the batch unchanged. Then everything works just fine.
The question is now, shall not inputs of Pendulum-v0 be normalized? Why does load_weights not work, if I normalize the inputs?

Unable to debug where torch Adam optimiser is going wrong

I was implementing a training loop in vscode. I have created a Adam optimizer using XLM-Roberta model as follows:
xlm_r_model = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base",
num_labels = NUM_LABELS,
output_attentions=False,
output_hidden_states=False
)
xlm_r_model.to(device)
optimizer = torch.optim.Adam(xlm_r_model.parameters(), lr=LR)
Then at following line:
optimizer.step()
vscode simply terminates the execution, without any error stack trace.
So I debugged to get to know exactly where this is happening. I reached this line, which makes F.adam(...) call:
Weirdly, on github, torch.optim.adam does not have this line. It seems that the closest matching line is line 150.
This call then goes to torch.optim._functional.adam:
In above image, those params (line 72) in for loop contains 201 elements and am unable to figure it out exactly which param is going wrong. When I continue it to run, it doesn't pause in debug mode whenever error occurs, instead vscode simply terminates.
Again, I am not able to find this function on github's _functional version
When I checked several Kaggle notebooks (1,2,3,4) for training xlm roberta, they are using AdamW and torch_xla package to train on TPUs something like this:
import torch_xla.core.xla_model as xm
optimizer = AdamW([{'params': model.roberta.parameters(), 'lr': LR},
{'params': [param for name, param in model.named_parameters() if 'roberta' not in name], 'lr': 1e-3} ], lr=LR, weight_decay=0)
xm.optimizer_step(optimizer)
Do I miss some contenxt and it is indeed compulsory to train using AdamW or torch_xla? Or am doing some stupid mistake?
PS:
Am running this no colab . Its pip shows torch version 1.10.0+cu111 and python 3.7.13. I have run codeserver on colab through colabcode and debugging in browser based vscode.
I was able to train Bert with Adam optimizer earlier.

Process finished with exit code -1073741819 (0xC0000005) - Rpy2

I have searched a lot for this error, on stack overflow and other websites but I cannot seem to find a solution to my problem.
Basically, I have a program that is in python, and I am using python's module rpy2 for communicating with some R functions, from python.
The problem is that when I run the code, sometimes, but not always I encounter this error. I am on windows. Sometimes when I restart my PC this code runs more exercises, but then eventually this error pops up again. What should I do ?
I have python 3.6.7, with PyCharm 2018.3.3. However I doubt the problem is from PyCharm because when I run my program from the cmd the same thing happens, except that the program halts directly without notifying me with the message "Process finished with exit code -1073741819 (0xC0000005)". This message only appears in PyCharm, but still.
I have rpy2 version 2.9.5
Code description
I do know, relatively, which part of the code is doing this, but I cannot optimize it more. In other words, In this part of the code, inside cross validation, I am over populating each of the train and validation sets in a certain way, and in order to do that, I am combining both X_train and y_train back into one data frame, overpopulating this data frame, and then getting back the updated, overpopulated, X_train and y_train, and performing my analysis on these overpopulated ones. I think combining both into numpy arrays into a pandas dataframe and then un-combining back is creating this memory error. Also its important to note that this is happening in each fold, and I'm doing a 10-folds-10-repeats cross validation. However, even when I run this on a Desktop PC rather than on my laptop the same thing happens, knowing that I have plenty of GBs left on my own laptop. I am doubting this is a python/rpy2 error ??
Code snippet
# I am calling this function inside each fold
df_combined = self.prepare_data(X_train, y_train)
and then after calling prepare_data() I do as follows:
# THE apply_f1(), apply_f2(), apply_f3(), and apply_f4() ARE THE FUNCTIONS
# THAT USE rpy2 INTERNALLY
if self.f1:
X_train_inner, y_train_inner = self.apply_f1(df_combined)
elif self.f2:
X_train_inner, y_train_inner = self.apply_f2(df_combined)
elif self.f3:
X_train_inner, y_train_inner = self.apply_f3(df_combined)
else:
X_train_inner, y_train_inner = self.apply_f4(df_combined)
The prepare_data() function:
def prepare_data(self, X_train, y_train):
'''
concatenates X_train_inner and y_train_inner into one, and make them a data frame
so we are able to process the data frame by SMOGN, RandUnder, GN, or SMOTER
'''
# reshape + rename
X_train_samp = X_train
y_train_samp = y_train.reshape(-1, 1)
# combine two numpy arrays together into one numpy array
combined = np.concatenate((X_train_samp, y_train_samp), axis=1)
# transform X_train + y_train into a pandas dataframe
column_names = self.other + [self.target_variable]
df_combined = pd.DataFrame(combined, columns=column_names)
# convert the combined pandas dataframe to R Data.Frame
df_combined = pandas2ri.py2ri(df_combined)
return df_combined

I have had this same error message "Process finished with exit code -1073741819 (0xC0000005)" with PyCharm 2021.1.
It happened because I selected Python 3.9 as an interpreter, while PyCharm was actually trying to use Python 3.10. And actually I had only Python 3.8 installed.
As far as I am concerned, the error disappeared after I selected Python 3.8 as an interpreter.

Process finished with exit code -1073740940 (0xc0000374) using Scikit-learn KernelPCA

First of all, I tried to perform dimensionality reduction on my n_samples x 53 data using scikit-learn's Kernel PCA with precomputed kernel. The code worked without any issues when I tried using 50 samples at first. However, when I increased the number of samples into 100, suddenly I got the following message.
Process finished with exit code -1073740940 (0xC0000374)
Here's the detail of what I want to do:
I want to obtain the optimum value of kernel function hyperparameter in my Kernel PCA function, defined as the following.
from sklearn.decomposition.kernel_pca import KernelPCA as drm
from somewhere import costfunction
from somewhere_else import customkernel
def kpcafun(w,X):
# X is sample
# w is hyperparam
n_princomp = 2
drmodel = drm(n_princomp,kernel='precomputed')
k_matrix = customkernel (X,X,w)
transformed_x = drmodel.fit_transform(k_matrix)
cost = costfunction(transformed_x)
return cost
Therefore, to optimize the hyperparams I used the following code.
from scipy.optimize import minimize
# assume that wstart and optimbound are already defined
res = minimize(kpcafun, wstart, method='L-BFGS-B', bounds=optimbound, args=(X))
The strange thing is when I tried to debug the first 10 iterations of the optimization process, nothing strange has happened all values of the variables seemed normal. But, when I turned off the breakpoints and let the program continue the message appeared without any error notification.
Does anyone know what might be wrong with my code? Or anyone has some tips to resolve a problem like this?
Thanks

Debugging Tensorflow hang on global variables initialisation

I'm after advice on how to debug what on Tensorflow is struggling with when it hangs.
I have a multi layer CNN which hangs upon global_variables_initializer() is run in the session. I am getting no errors or messages on the console output.
Is there an intelligent way of debugging what Tensorflow is struggling with when it hangs instead of repeatedly commenting out lines of code that makes the graph, and re-running to see where it hangs. Would TensorFlow debugger (tfdbg) help? What options do I have?
Ideally it would be great to just to break current execution and look at some stack or similar to see where the execution is hanging during the init.
I'm currently running Tensorflow 0.12.1 with Python 3 inside a Jupiter notebook.

I managed to solve the problem. The tip from #amo-ej1 to run in a regular file was a step in the correct direction. This uncovered that the tensor flow process was killing itself off with a SIGKILL and returning an error code of 137.
I tried Tensorflow Debugger tfdbg though this did not provide any further details as the problem was the graph did not initialize. I started to think the graph structure was incorrect, so I dumped out the graph structure using:
tf.summary.FileWriter('./logs/traing_graph', graph)
I then used up Tensorboard to inspect the resultant summary graph structure data dumped out the the directory and found that the tensor dimensions of the Fully Connected layer was wrong , having a width of 15million !!?! (wrong)
It turned out that one of the configurable parameters of the graph was incorrect. It was picking the dimension of the layer 2 tensor shape incorrectly from an incorrect addressing the previous tf.shape type property and it exploded the dimensions of the graph.
There were no OOM error messages in /var/log/system.log so I am unsure why the graph initialisation caused the python tensorflow script process to die.
I fixed the dimensions of the graph and graph initialization worked just fine!
My top tip is visualise your graph with Tensorboard before initialisation and training to do a quick check the resultant graph structure you coded it what you expected it to be. You probably will save yourself a lot of time! :-)

A common methodology to debug tensorflow is to replace the placeholders and/or variables with numpy arrays and put them inside tf.const. When you do so you can actually examine the logic of your code by setting a breakpoints and to see numbers in "pythoninc" and not just tensors. It will be much easier to help you if you would post your code here, but here is a dummy example:
with tf.name_scope('scope_name'):
### This block is for debug only
import numpy as np
batch_size = 20
sess = tf.Session()
sess.run(tf.tables_initializer())
init_op = tf.global_variables_initializer()
sess.run(init_op)
### End of first debug block
## Replacing Placeholders for debug - uncomment the placehlolders and comment the numpy arrays to producation mode
const_a = tf.constant((np.random.rand(batch_size, 26) > 0.85).astype(int), dtype=tf.float32)
const_b = tf.constant(np.random.randint(0, 20, batch_size * 26).reshape((batch_size, 26)), dtype=tf.float32)
# real_a_placeholder = tf.log(input_placeholder_dict[A_DATA])
# real_b_placeholder = tf.log(input_placeholder_dict[B_DATA])
# dummy opreation
c = a - b
# selecting top k - in the sanity check you can see here that you actullay get the top items and top values
top_k = 5
top_k_values, top_k_indices = tf.nn.top_k(c,
k=top_k, sorted=True,
name="top_k")
## Replacing Variable for debug - uncomment the variables and comment the numpy arrays to producation mode
Now, run your code with breakpoints and you have 2 options to see the values in the debugger:
1.sess.run(palceholder_name)
2.you can use eval - varaible_name.eval(sessnio=sess)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

No console output using Keras model.fit() function - python

Related

load_weights does not work correctly for DDPGAgent (keras-rl)

Unable to debug where torch Adam optimiser is going wrong

Process finished with exit code -1073741819 (0xC0000005) - Rpy2

Process finished with exit code -1073740940 (0xc0000374) using Scikit-learn KernelPCA

Debugging Tensorflow hang on global variables initialisation

Categories

Resources