MultiprocessIterator throws error when changing batch_size - python

I want to train a Faster R-CNN with ChainerCV. As a first test I mostly copied the provided example, I only changed the lines corresponding the dataset to use my custom dataset. I checked if my dataset is fully functional with all operations discribed in this tutorial.
If I run the script without changes everything works perfect, but if I change the batch_size I get an error. I tried increasing the shared_mem from 100 MB to 1000 MB, but the error didn’t disappear.
Error when setting the batch_size=2:
Exception in main training loop: all the input array dimensions except for the concatenation axis must match exactly
Traceback (most recent call last):
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 171, in update_core
in_arrays = self.converter(batch, self.device)
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/dataset/convert.py", line 134, in concat_examples
[example[i] for example in batch], padding[i])))
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/dataset/convert.py", line 164, in _concat_arrays
return xp.concatenate([array[None] for array in arrays])
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "/home/cv/ChainerCV/faster_rcnn/train.py", line 131, in <module>
main()
File "/home/cv/ChainerCV/faster_rcnn/train.py", line 126, in main
trainer.run()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/trainer.py", line 329, in run
six.reraise(*sys.exc_info())
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/training/updaters/standard_updater.py", line 171, in update_core
in_arrays = self.converter(batch, self.device)
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/dataset/convert.py", line 134, in concat_examples
[example[i] for example in batch], padding[i])))
File "/home/cv/anaconda3/envs/chainer/lib/python3.6/site-packages/chainer/dataset/convert.py", line 164, in _concat_arrays
return xp.concatenate([array[None] for array in arrays])
ValueError: all the input array dimensions except for the concatenation axis must match exactly
System info:
__Hardware Information__
Machine : x86_64
CPU Name : skylake
Number of accessible CPU cores : 8
__OS Information__
Platform : Linux-4.15.0-45-generic-x86_64-with-debian-stretch-sid
Release : 4.15.0-45-generic
System Name : Linux
Version : #48~16.04.1-Ubuntu SMP Tue Jan 29 18:03:48 UTC 2019
OS specific info : debianstretch/sid
glibc info : glibc 2.10
__CUDA Information__
Found 1 CUDA devices
id 0 b'GeForce GTX 1080' [SUPPORTED]
compute capability: 6.1
pci device id: 0
pci bus id: 1
Summary:
1/1 devices are supported
CUDA driver version : 10000
__Conda Information__
conda_build_version : 3.17.6
conda_env_version : 4.6.3
platform : linux-64
python_version : 3.7.1.final.0
EDIT: When running the example with batch_size=2 the error also occurs.

While trying to fix the error I got another error.
ValueError: Currently only batch size 1 is supported.
Waiting seems to be the solution.

Current Faster-RCNN implementation does not support multi batch training, but you can rewrite it to support it like code below.
https://github.com/knorth55/chainer-light-head-rcnn/blob/master/light_head_rcnn/links/model/light_head_rcnn_train_chain.py
Another option is using Faster-RCNN with FPN in ChainerCV.
The latest version of ChainerCV has Faster-RCNN with FPN which supports multi batch training.
https://github.com/chainer/chainercv/blob/master/examples/fpn/train_multi.py

self.converter assumes that the first argument of batch is composed of inputs that have the same shape. For example, if you use image dataset, all images are supposed to have the shape of (C, H, W).
So, could you check your dataset returns images of the same shape?
And if your dataset has various shapes of images, how about using TransformDataset like https://github.com/chainer/chainercv/blob/df63b74ef20f9d8c830e266881e577dd05c17442/examples/faster_rcnn/train.py#L86?

Related

Unexpected segmentation fault encountered in worker when using Pytorch Geometric to load dataset

I encounter the following error when using DataLoader workers to load data.
I am using NeighborSampler in PyG as "loader" in run_main.py line 152 to load custom dataset, and use num_workers of os.cpu_count().
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1096707) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run_main.py", line 152, in train
for step, _ in enumerate(loader):
File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in __next__
data = self._next_data()
File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
idx, data = self._get_data()
File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data
success, data = self._try_get_data()
File "/home/user/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1096707) exited unexpectedly
I am using Pytorch 1.12.0+cu116, one NVIDIA TITAN Xp GPU, CUDA version of 11.6, and Python version of 3.8.10.
I've searched a lot for this error, and found following solutions. However, they were all unhelpful.
Using num_workers of 0 or 1. When I run with num_workers of 0, it results in "corrupted double-linked list" error. When I set num_workers as 1, the same error (Unexpected segmentation fault encountered in worker) still occurs. Actually, I don't want to lessen num_workers, because I am working on a pretty large dataset and lessening num_workers is a much slower option.
Increasing shared memory size. I've done this by adding none /dev/shm tmpfs defaults,size=MY_SIZEG 0 0 line in /etc/fstab and running mount -o remount /dev/shm. I've set MY_SIZE exactly as the size of main memory (which was previously 50% of main memory).
Changing Python version to <= 3.6.9. I've tried this, but the same error still occurs.
Checking that Python and dataset are mounted on the same disk. I've already verified that they are mounted on the same disk.
I've been struggling to fix this issue for several days, but I can't find the right solution, and it really makes me frustrated. Could you please help me out?

Problem with executing python machine learning code I found on GitHub

I need some clear instructions on how to execute some code.
Context:
This is a python machine learning peptide binding script, but you don't need to know biology to help me.
I am trying to recreate this scientific paper to test its validity and if I can use it. I work in the biotech industry and am only somewhat familiar with C# and python.
The paper is linked to a GitHub page. And the GitHub page has some instructions on how to execute the code. But every time I try to execute this code as instructed, it gives me an error. I already installed its requirements of the most updated pytorch, numpy, scikit-learn; I also switched between GPU and CPU, but no method worked. I don't know what to do at this point.
Paper Title:
"Prediction of Specific TCR-Peptide Binding From Large Dictionaries of TCR-Peptide Pairs" by Ido Springer, Hanan Besser. etc.
Paper's Github8 (found in the paper's abstract):
https://github.com/louzounlab/ERGO
These are the example codes I input in the terminal. The example code was found in a comment at the end of ERGO.py
GPU ver:
python ERGO.py train lstm mcpas specific cuda:0 --model_file=model.pt --train_data_file=train_data --test_data_file=test_data
GPU code results:
Traceback (most recent call last): File "D:\D Download\ERGO-master\ERGO.py", line 437, in <module>
main(args) File "D:\D Download\ERGO-master\ERGO.py", line 141, in main
model, best_auc, best_roc = lstm.train_model(train_batches, test_batches, args.device, arg, params) File "D:\D Download\ERGO-master\lstm_utils.py", line 163, in train_model
model.to(device) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 927, in to
return self._apply(convert) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
module._apply(fn) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 602, in _apply
param_applied = fn(param) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\cuda\__init__.py", line 211, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled
CPU code ver (only replaced specific cuda:0 with specific cpu):
python ERGO.py train lstm mcpas specific cpu --model_file=model.pt --train_data_file=train_data --test_data_file=test_data
CPU code results:
epoch: 1 C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py:1960: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") Traceback (most recent call last): File "D:\D Download\ERGO-master\ERGO.py", line 437, in <module>
main(args) File "D:\D Download\ERGO-master\ERGO.py", line 141, in main
model, best_auc, best_roc = lstm.train_model(train_batches, test_batches, args.device, arg, params) File "D:\D Download\ERGO-master\lstm_utils.py", line 173, in train_model
loss = train_epoch(batches, model, loss_function, optimizer, device) File "D:\D Download\ERGO-master\lstm_utils.py", line 137, in train_epoch
loss = loss_function(probs, batch_signs) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\loss.py", line 613, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction) File "C:\Users\username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py", line 3074, in binary_cross_entropy
raise ValueError( ValueError: Using a target size (torch.Size([50])) that is different to the input size (torch.Size([50, 1])) is deprecated. Please ensure they have the same size.
Looking at the ValueError, it seems that what you're trying to do is deprecated in pytorch, so you have a more recent version of the package than the one it was developed in. I suggest you try
pip install pytorch 1.4.0
in command line.
I'm not familiar with pytorch but menaging tensor shapes in tensorflow is the biggest pain in the a** for me. What it actually looks like to be the problem is that the input has an extra dimension than it should, so you would have to manually reshape it.

Numpy error in file "mtrand.pyx" while fitting a keras model

I am using:
keras-rl2 : 1.0.4
tensorflow : 2.4.1
numpy : 1.19.5
gym 0.18.0
For the training of a DQN model for a reinforcement learning project.
My action space contains 60 dicrete values:
self.action_space = Discrete(60)
and I am getting this error after x steps:
1901/10000 [====>.........................] - ETA: 1:02 - reward: 6.1348Traceback (most recent call last):
File "D:/GitHub/networth-simulation/rebalancing_simple_discrete.py", line 203, in <module>
dqn.fit(env, nb_steps=5000, visualize=False, verbose=1)
File "D:\GitHub\networth-simulation\venv\lib\site-packages\rl\core.py", line 169, in fit
action = self.forward(observation)
File "D:\GitHub\networth-simulation\venv\lib\site-packages\rl\agents\dqn.py", line 227, in forward
action = self.policy.select_action(q_values=q_values)
File "D:\GitHub\networth-simulation\venv\lib\site-packages\rl\policy.py", line 227, in select_action
action = np.random.choice(range(nb_actions), p=probs)
File "mtrand.pyx", line 928, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
When I use a lower number of discrete actions (<10) it doesn't happen every time.
I found that fix but I don't understand how to apply it. I can't find any file "numpy/random/mtrand/mtrand.pyx"
Has anyone found a way to fix that error?
This error apparently pops up when the values of environment class attributes become too high (positively or negatively).
I solved the problem with limitations in my step() method.

ML Engine: Prediction Error while executing local predict command

I have uploaded a version of the model in the Google ML Engine with saved_model.pb and a variables folder. When I try to execute the command:
gcloud ml-engine local predict --model-dir=saved_model --json-instances=request.json
It shows the following error:
ERROR: (gcloud.ml-engine.local.predict) 2018-09-11 19:06:39.770396: I tensorflow/core/platform/cpu_feature_guard.cc:141]
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
File "lib/googlecloudsdk/command_lib/ml_engine/local_predict.py", line 172, in <module>
main()
File "lib/googlecloudsdk/command_lib/ml_engine/local_predict.py", line 167, in main
signature_name=args.signature_name)
File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/prediction_lib.py", line 106, in local_predict
predictions = model.predict(instances, signature_name=signature_name)
File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/prediction_utils.py", line 230, in predict
preprocessed = self.preprocess(instances, stats=stats, **kwargs)
File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/frameworks/tf_prediction_lib.py", line 436, in preprocess
preprocessed = self._canonicalize_input(instances, signature)
File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/frameworks/tf_prediction_lib.py", line 453, in _canonicalize_input
return canonicalize_single_tensor_input(instances, tensor_name)
File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/frameworks/tf_prediction_lib.py", line 166, in canonicalize_single_tensor_input
instances = [parse_single_tensor(x, tensor_name) for x in instances]
File "/usr/lib/google-cloud-sdk/lib/third_party/ml_sdk/cloud/ml/prediction/frameworks/tf_prediction_lib.py", line 162, in parse_single_tensor
(tensor_name, list(x.keys())))
cloud.ml.prediction.prediction_utils.PredictionError: Invalid inputs: Expected tensor name: inputs, got tensor name: [u'inputs', u'key']. (Error code: 1)
My request.json file is
{"inputs": {"b64": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAHVArwDASIAAhEBAxEB/8QAHwAAAQUBAQEBA....."}, "key": "841bananas.jpg"}
Thanks in advance.
It appears your model was exported with only one input named "inputs". In that case, you shouldn't be sending "key" in the JSON, i.e., (scroll to the end to see I've removed "keys"):
{"inputs": {"b64": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAHVArwDASIAAhEBAxEB/8QAHwAAAQUBAQEBA....."}}

[Tensorflow][Object detection] ValueError when try to train with --num_clones=2

I wanted to train on multiple CPU so i run this command
C:\Users\solution\Desktop\Tensorflow\research>python
object_detection/train.py --logtostderr
--pipeline_config_path=C:\Users\solution\Desktop\Tensorflow\myFolder\power_drink.config --train_dir=C:\Users\solution\Desktop\Tensorflow\research\object_detection\train
--num_clones=2 --clone_on_cpu=True
and i got the following error
Traceback (most recent call last): File "object_detection/train.py",
line 169, in
tf.app.run() File "C:\Users\solution\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py",
line 124, in run
_sys.exit(main(argv)) File "object_detection/train.py", line 165, in main
worker_job_name, is_chief, FLAGS.train_dir) File "C:\Users\solution\Desktop\Tensorflow\research\object_detection\trainer.py",
line 246, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File
"C:\Users\solution\Desktop\Tensorflow\research\slim\deployment\model_deploy.py",
line 193, in create_clones
outputs = model_fn(*args, **kwargs) File "C:\Users\solution\Desktop\Tensorflow\research\object_detection\trainer.py",
line 158, in _create_losses
train_config.merge_multiple_label_boxes) ValueError: not enough values to unpack (expected 7, got 0)
If i set num_clones to 1 or omitted it, it works normally.
I also tries setting --ps_tasks=1 which doesn't help
any advice would be appreciated
I solved this issue by changing one parameter in my original configuration slightly:
...
train_config: {
fine_tune_checkpoint: "C:/some_path/model.ckpt"
batch_size: 1
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
num_steps: 25000
...
}
...
Changing the parameter replicas_to_aggregate: 1, or setting sync_replicas: false both solves the problem for me, since I was training only on one graphics card and did not have any replicas (as you would have when training on TPU).
You don't mention which type of model you are training - if like me you were using the default model from the TensorFlow Object Detection API example (Faster-RCNN-Inception-V2) then num_clones should equal the batch_size. I was using a GPU however, but when I went from one clone to two, I saw a similar error and setting batch_size: 2 in the training config file was the solution.

Categories

Resources