I am using:
keras-rl2 : 1.0.4
tensorflow : 2.4.1
numpy : 1.19.5
gym 0.18.0
For the training of a DQN model for a reinforcement learning project.
My action space contains 60 dicrete values:
self.action_space = Discrete(60)
and I am getting this error after x steps:
1901/10000 [====>.........................] - ETA: 1:02 - reward: 6.1348Traceback (most recent call last):
File "D:/GitHub/networth-simulation/rebalancing_simple_discrete.py", line 203, in <module>
dqn.fit(env, nb_steps=5000, visualize=False, verbose=1)
File "D:\GitHub\networth-simulation\venv\lib\site-packages\rl\core.py", line 169, in fit
action = self.forward(observation)
File "D:\GitHub\networth-simulation\venv\lib\site-packages\rl\agents\dqn.py", line 227, in forward
action = self.policy.select_action(q_values=q_values)
File "D:\GitHub\networth-simulation\venv\lib\site-packages\rl\policy.py", line 227, in select_action
action = np.random.choice(range(nb_actions), p=probs)
File "mtrand.pyx", line 928, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
When I use a lower number of discrete actions (<10) it doesn't happen every time.
I found that fix but I don't understand how to apply it. I can't find any file "numpy/random/mtrand/mtrand.pyx"
Has anyone found a way to fix that error?
This error apparently pops up when the values of environment class attributes become too high (positively or negatively).
I solved the problem with limitations in my step() method.
Related
I need some clear instructions on how to execute some code.
Context:
This is a python machine learning peptide binding script, but you don't need to know biology to help me.
I am trying to recreate this scientific paper to test its validity and if I can use it. I work in the biotech industry and am only somewhat familiar with C# and python.
The paper is linked to a GitHub page. And the GitHub page has some instructions on how to execute the code. But every time I try to execute this code as instructed, it gives me an error. I already installed its requirements of the most updated pytorch, numpy, scikit-learn; I also switched between GPU and CPU, but no method worked. I don't know what to do at this point.
Paper Title:
"Prediction of Specific TCR-Peptide Binding From Large Dictionaries of TCR-Peptide Pairs" by Ido Springer, Hanan Besser. etc.
Paper's Github8 (found in the paper's abstract):
https://github.com/louzounlab/ERGO
These are the example codes I input in the terminal. The example code was found in a comment at the end of ERGO.py
GPU ver:
python ERGO.py train lstm mcpas specific cuda:0 --model_file=model.pt --train_data_file=train_data --test_data_file=test_data
GPU code results:
Traceback (most recent call last): File "D:\D Download\ERGO-master\ERGO.py", line 437, in <module>
main(args) File "D:\D Download\ERGO-master\ERGO.py", line 141, in main
model, best_auc, best_roc = lstm.train_model(train_batches, test_batches, args.device, arg, params) File "D:\D Download\ERGO-master\lstm_utils.py", line 163, in train_model
model.to(device) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 927, in to
return self._apply(convert) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
module._apply(fn) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 602, in _apply
param_applied = fn(param) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\cuda\__init__.py", line 211, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled
CPU code ver (only replaced specific cuda:0 with specific cpu):
python ERGO.py train lstm mcpas specific cpu --model_file=model.pt --train_data_file=train_data --test_data_file=test_data
CPU code results:
epoch: 1 C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py:1960: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") Traceback (most recent call last): File "D:\D Download\ERGO-master\ERGO.py", line 437, in <module>
main(args) File "D:\D Download\ERGO-master\ERGO.py", line 141, in main
model, best_auc, best_roc = lstm.train_model(train_batches, test_batches, args.device, arg, params) File "D:\D Download\ERGO-master\lstm_utils.py", line 173, in train_model
loss = train_epoch(batches, model, loss_function, optimizer, device) File "D:\D Download\ERGO-master\lstm_utils.py", line 137, in train_epoch
loss = loss_function(probs, batch_signs) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\loss.py", line 613, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py", line 3074, in binary_cross_entropy
raise ValueError( ValueError: Using a target size (torch.Size([50])) that is different to the input size (torch.Size([50, 1])) is deprecated. Please ensure they have the same size.
Looking at the ValueError, it seems that what you're trying to do is deprecated in pytorch, so you have a more recent version of the package than the one it was developed in. I suggest you try
pip install pytorch 1.4.0
in command line.
I'm not familiar with pytorch but menaging tensor shapes in tensorflow is the biggest pain in the a** for me. What it actually looks like to be the problem is that the input has an extra dimension than it should, so you would have to manually reshape it.
I encountered this error while running simpletransformers on Google Colab. I enabled h/w accelerator as GPU and ran the code.
from simpletransformers.classification import ClassificationModel
# Create a ClassificationModel
model = ClassificationModel("bert", "sagorsarker/bangla-bert-base",num_labels=2, use_cuda=True, args={
'reprocess_input_data': True,
'use_cached_eval_features': False,
'overwrite_output_dir': True,
'num_train_epochs': 3,
'silent': True
})
# Train the model
model.train_model(train)
Full Error -
Some weights of the model checkpoint at sagorsarker/bangla-bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sagorsarker/bangla-bert-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/usr/local/lib/python3.7/dist-packages/simpletransformers/classification/classification_model.py:449: UserWarning: Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels.
"Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."
Exception in thread Thread-13:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks
put(task)
File "/usr/lib/python3.7/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
pyo3_runtime.PanicException: no entry found for key
And it freezes at this moment. Nothing happens despite the cell running. What should I do to make it work?
Edit:
I solved it for the time being using simpletransformers version 0.50.0
I was trying to use hungry-geese gym here to train PPO:
from kaggle_environments import make
from stable_baselines3 import PPO
directions = {0:'EAST', 1:'NORTH', 2:'WEST', 3:'SOUTH'}
loaded_model = PPO.load('logs\\dqn2ppo_nonvec\\model')
def agent_ppo(obs, config):
a = directions[loaded_model.predict(obs)[0]]
return a
env = make('hungry_geese',debug=True)
env.run([agent_ppo,'agent_bfs.py'])
env.render(mode="ipython")
But my game was getting played for only one step. After running with debug ON I got following trace:
Traceback (most recent call last):
File "c:\users\crrma\.virtualenvs\hungry_geese-ept5y6nv\lib\site-packages\kaggle_environments\agent.py", line 151, in act
action = self.agent(*args)
File "<ipython-input-29-faad97d317d6>", line 5, in agent_ppo
a = directions[loaded_model.predict(obs)[0]]
File "c:\users\crrma\.virtualenvs\hungry_geese-ept5y6nv\lib\site-packages\stable_baselines3\common\base_class.py", line 497, in predict
return self.policy.predict(observation, state, mask, deterministic)
File "c:\users\crrma\.virtualenvs\hungry_geese-ept5y6nv\lib\site-packages\stable_baselines3\common\policies.py", line 262, in predict
observation = ObsDictWrapper.convert_dict(observation)
File "c:\users\crrma\.virtualenvs\hungry_geese-ept5y6nv\lib\site-packages\stable_baselines3\common\vec_env\obs_dict_wrapper.py", line 68, in convert_dict
return np.concatenate([observatio I was trying to use hungry-geese gym [here](https://www.kaggle.com/victordelafuente/dqn-goose-with-stable-baselines3-pytorch#) to train PPO. But my game was getting played for only one step. After running with debug ON I got following trace:
Traceback (most recent call last):
File "c:\users\crrma\.virtualenvs\hungry_geese-ept5y6nv\lib\site-packages\kaggle_environments\agent.py", line 151, in act
action = self.agent(*args)
File "<ipython-input-29-faad97d317d6>", line 5, in agent_ppo
a = directions[loaded_model.predict(obs)[0]]
File "c:\users\crrma\.virtualenvs\hungry_geese-ept5y6nv\lib\site-packages\stable_baselines3\common\base_class.py", line 497, in predict
return self.policy.predict(observation, state, mask, deterministic)
File "c:\users\crrma\.virtualenvs\hungry_geese-ept5y6nv\lib\site-packages\stable_baselines3\common\policies.py", line 262, in predict
observation = ObsDictWrapper.convert_dict(observation)
File "c:\users\crrma\.virtualenvs\hungry_geese-ept5y6nv\lib\site-packages\stable_baselines3\common\vec_env\obs_dict_wrapper.py", line 68, in convert_dict
return np.concatenate([observation_dict[observation_key], observation_dict[goal_key]], axis=-1)
KeyError: 'observation'
So I debuged a bit more in vscode. As can be seen in below screenshot, both observation and desired_goal keys are not present in observation_dict.
Also this is how I debuged to above call:
Am I using API incorrectly for this to occur (am new to the API)? (or this can be a bug, which I feel highly unlikely.)
Colab notebook and model
I want to train a Faster R-CNN with ChainerCV. As a first test I mostly copied the provided example, I only changed the lines corresponding the dataset to use my custom dataset. I checked if my dataset is fully functional with all operations discribed in this tutorial.
If I run the script without changes everything works perfect, but if I change the batch_size I get an error. I tried increasing the shared_mem from 100 MB to 1000 MB, but the error didn’t disappear.
Error when setting the batch_size=2:
Exception in main training loop: all the input array dimensions except for the concatenation axis must match exactly
Traceback (most recent call last):
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 171, in update_core
in_arrays = self.converter(batch, self.device)
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/dataset/convert.py", line 134, in concat_examples
[example[i] for example in batch], padding[i])))
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/dataset/convert.py", line 164, in _concat_arrays
return xp.concatenate([array[None] for array in arrays])
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "/home/cv/ChainerCV/faster_rcnn/train.py", line 131, in <module>
main()
File "/home/cv/ChainerCV/faster_rcnn/train.py", line 126, in main
trainer.run()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/trainer.py", line 329, in run
six.reraise(*sys.exc_info())
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 171, in update_core
in_arrays = self.converter(batch, self.device)
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/dataset/convert.py", line 134, in concat_examples
[example[i] for example in batch], padding[i])))
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/dataset/convert.py", line 164, in _concat_arrays
return xp.concatenate([array[None] for array in arrays])
ValueError: all the input array dimensions except for the concatenation axis must match exactly
System info:
__Hardware Information__
Machine : x86_64
CPU Name : skylake
Number of accessible CPU cores : 8
__OS Information__
Platform : Linux-4.15.0-45-generic-x86_64-with-debian-stretch-sid
Release : 4.15.0-45-generic
System Name : Linux
Version : #48~16.04.1-Ubuntu SMP Tue Jan 29 18:03:48 UTC 2019
OS specific info : debianstretch/sid
glibc info : glibc 2.10
__CUDA Information__
Found 1 CUDA devices
id 0 b'GeForce GTX 1080' [SUPPORTED]
compute capability: 6.1
pci device id: 0
pci bus id: 1
Summary:
1/1 devices are supported
CUDA driver version : 10000
__Conda Information__
conda_build_version : 3.17.6
conda_env_version : 4.6.3
platform : linux-64
python_version : 3.7.1.final.0
EDIT: When running the example with batch_size=2 the error also occurs.
While trying to fix the error I got another error.
ValueError: Currently only batch size 1 is supported.
Waiting seems to be the solution.
Current Faster-RCNN implementation does not support multi batch training, but you can rewrite it to support it like code below.
https://github.com/knorth55/chainer-light-head-rcnn/blob/master/light_head_rcnn/links/model/light_head_rcnn_train_chain.py
Another option is using Faster-RCNN with FPN in ChainerCV.
The latest version of ChainerCV has Faster-RCNN with FPN which supports multi batch training.
https://github.com/chainer/chainercv/blob/master/examples/fpn/train_multi.py
self.converter assumes that the first argument of batch is composed of inputs that have the same shape. For example, if you use image dataset, all images are supposed to have the shape of (C, H, W).
So, could you check your dataset returns images of the same shape?
And if your dataset has various shapes of images, how about using TransformDataset like https://github.com/chainer/chainercv/blob/df63b74ef20f9d8c830e266881e577dd05c17442/examples/faster_rcnn/train.py#L86?
I wanted to train on multiple CPU so i run this command
C:\Users\solution\Desktop\Tensorflow\research>python
object_detection/train.py --logtostderr
--pipeline_config_path=C:\Users\solution\Desktop\Tensorflow\myFolder\power_drink.config --train_dir=C:\Users\solution\Desktop\Tensorflow\research\object_detection\train
--num_clones=2 --clone_on_cpu=True
and i got the following error
Traceback (most recent call last): File "object_detection/train.py",
line 169, in
tf.app.run() File "C:\Users\solution\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py",
line 124, in run
_sys.exit(main(argv)) File "object_detection/train.py", line 165, in main
worker_job_name, is_chief, FLAGS.train_dir) File "C:\Users\solution\Desktop\Tensorflow\research\object_detection\trainer.py",
line 246, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File
"C:\Users\solution\Desktop\Tensorflow\research\slim\deployment\model_deploy.py",
line 193, in create_clones
outputs = model_fn(*args, **kwargs) File "C:\Users\solution\Desktop\Tensorflow\research\object_detection\trainer.py",
line 158, in _create_losses
train_config.merge_multiple_label_boxes) ValueError: not enough values to unpack (expected 7, got 0)
If i set num_clones to 1 or omitted it, it works normally.
I also tries setting --ps_tasks=1 which doesn't help
any advice would be appreciated
I solved this issue by changing one parameter in my original configuration slightly:
...
train_config: {
fine_tune_checkpoint: "C:/some_path/model.ckpt"
batch_size: 1
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
num_steps: 25000
...
}
...
Changing the parameter replicas_to_aggregate: 1, or setting sync_replicas: false both solves the problem for me, since I was training only on one graphics card and did not have any replicas (as you would have when training on TPU).
You don't mention which type of model you are training - if like me you were using the default model from the TensorFlow Object Detection API example (Faster-RCNN-Inception-V2) then num_clones should equal the batch_size. I was using a GPU however, but when I went from one clone to two, I saw a similar error and setting batch_size: 2 in the training config file was the solution.