Getting an error while training Resnet50 on Imagenet at 14th Epoch - python

I am training Resnet50 on imagenet using the script provided from PyTorch (with a slight trivial tweak for my purpose). However, I am getting the following error after 14 epochs of training. I have allocated 4 gpus in the server I'm using to run this. Any pointers as to what this error is about would be appreciated. Thanks a lot!
Epoch: [14][5000/5005] Time 1.910 (2.018) Data 0.000 (0.191) Loss 2.6954 (2.7783) Total 2.6954 (2.7783) Reg 0.0000 Prec#1 42.969 (40.556) Prec#5 64.844 (65.368)
Test: [0/196] Time 86.722 (86.722) Loss 1.9551 (1.9551) Prec#1 51.562 (51.562) Prec#5 81.641 (81.641)
Traceback (most recent call last):
File "main_group.py", line 549, in <module>
File "main_group.py", line 256, in main
File "main_group.py", line 466, in validate
if args.gpu is not None:
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
return self._process_data(data)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
data.reraise()
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 11.
Original Traceback (most recent call last):
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 138, in __getitem__
sample = self.loader(path)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 174, in default_loader
return pil_loader(path)
File "/home/users/oiler/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 155, in pil_loader
with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'

It is difficult to tell what the problem is just by looking at the error you have posted.
All we know is that there was an issue reading the file at '/data/users2/oiler/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'.
Try the following:
Confirm the file actually exists.
Confirm that it is infact a valid JPEG and not corrupted (by viewing it).
Confirm that you can open it with Python and also load it manually with PIL.
If none of that works, try deleting the file. Do you get the same error on another file in the folder?

Related

pyspark unable load pipelineModel

I met a problem, unable to load PipelineModel
I test my model in practice environment, but unable to apply this model and code on production environment
Traceback (most recent call last):
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/predict/model_predict.py", line 228, in <module>
main(xdr_input_file,model_file,xdr_output_file)
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/predict/model_predict.py", line 215, in main
xdr_df_predict = xdr_predict(xdr_df,model_file)
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/predict/model_predict.py", line 193, in xdr_predict
loadmodel = PipelineModel.load(model_input_path)
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 257, in load
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 197, in load
File "/usr/bch/1.5.0/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name pyspark.ml.pipeline.PipelineModel'
21/12/01 12:01:06 INFO SparkContext: Invoking stop() from shutdown hook
Thanks all the help.I am a intern in bigdata industry.This is my first time to post in stackoverflow,i am sorry to post in unreguired method.
Finally,i sovled this problem to adjust my code from spark2.4 to spark2.2.
Here is details about this tracesback:
I test my code in test environment under the version of spark2.4 and python3.7; i meet error when i deploy it in product environment under the version of spark2.2 and python3.7.
I train model under product envirenment, this si Model generate error:
Traceback (most recent call last):
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/train/model_generate.py", line 331, in
main(xdr_file_path,jingfeng_file_path,save_model_path)
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/train/model_generate.py", line 318, in main
tvs_piplineModel, gbdt_bestModel = generate_model(label_col, xdr_75109_String_title, union_df, save_model_path)
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/train/model_generate.py", line 310, in generate_model
tvs_piplineModel.save(save_model_path)
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 217, in save
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 212, in write
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 100, in init
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 249, in _to_java
AttributeError: 'TrainValidationSplitModel' object has no attribute '_to_java'
when i skip model generate to model predict in model which i trained in test envirenment,this is Model predict error:
Traceback (most recent call last):
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/predict/model_predict.py", line 228, in
main(xdr_input_file,model_file,xdr_output_file)
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/predict/model_predict.py", line 215, in main
xdr_df_predict = xdr_predict(xdr_df,model_file)
File "/home/fwfx_yaofei/telbd-yjy/src/ml/complain_user_it/predict/model_predict.py", line 193, in xdr_predict
loadmodel = PipelineModel.load(model_input_path)
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 257, in load
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 197, in load
File "/usr/bch/1.5.0/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/bch/1.5.0/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name pyspark.ml.pipeline.PipelineModel'
I check official document it explains ML persistence:version changes of ML persistence
So the error might casused by the vision of spark.
I ignore the "TrainValidationSplitModel" function which mention in first traceback,and it does work.My code run successfully.
Success run screenshot
Conclusion, my code aim to deploy a machine-learning classificaiton model in product environment. So i import pyspark.ml.gbdt to process dataframe. But i ignore the vision of test and product envirenoment.Thanks for all the help,this is the experience of an chinese intern.Forgive my pool expressive ability.

NonMatchingSplitsSizesError loading huggingface BookCorpus

I want to load bookcorpus like this:
train_ds, test_ds = load_dataset('bookcorpus', split=['train', 'test']),
however, get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/load.py", line 1627, in load_dataset
builder_instance.download_and_prepare(
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/builder.py", line 607, in download_and_prepare
self._download_and_prepare(
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/builder.py", line 709, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=4853859824, num_examples=74004228, dataset_name='bookcorpus'), 'recorded': SplitInfo(name='train', num_bytes=2982081448, num_examples=45726619, dataset_name='bookcorpus')}]
I want to proceed to save this to disk as I don't want to download this every time I use it. What causes this error?
BookCorpus is no longer publicly available.
Here is a work around:
https://github.com/soskek/bookcorpus

TypeError: __array__() takes 1 positional argument but 2 were given

I've been doing the pytorch tutorial (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) and have been getting this error that I don't know how to fix. The full error is below:
Traceback (most recent call last):
File "main.py", line 146, in <module>
main()
File "main.py", line 138, in main
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
File "/engine.py", line 26, in train_one_epoch
for images, targets in metric_logger.log_every(data_loader, print_freq, header):
File "/utils.py", line 180, in log_every
for obj in iterable:
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataset.py", line 311, in __getitem__
return self.dataset[self.indices[idx]]
File "main.py", line 64, in __getitem__
img, target = self.transforms(img, target)
File "/transforms.py", line 26, in __call__
image, target = t(image, target)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/transforms.py", line 50, in forward
image = F.to_tensor(image)
File "/usr/local/lib/python3.6/dist-packages/torchvision/transforms/functional.py", line 129, in to_tensor
np.array(pic, mode_to_nptype.get(pic.mode, np.uint8), copy=True)
TypeError: __array__() takes 1 positional argument but 2 were given
I believe it means somewhere I'm using an array with 2 arguments which isn't allowed, but I don't really know where abouts that is happening - perhaps in one of their pre written libraries?
I can share the code in full if desired, but thought its a bit unwieldy. Does anyone know what might be causing this error?
PyTorch has already considered this issue. It does not seem to be a PyTorch problem.
As xwang233 mentioned in the issue, we can fix it by downgrading pillow:
pip install pillow==8.2.0
This issue could be fixed as well by upgrading Pillow from version 8.3.0 to 8.3.1. I had the same issue with
torch==1.9.0+cu111
torchvision==0.10.0+cu111
Pillow==8.3.0
After Pillow was upgraded to 8.3.1 (with no change to torch and torchvision) as below, the issue is gone:
pip install --upgrade pillow
Thanks to DRTorresRuiz for providing the clue about Pillow.
I had the same error when using:
torch==1.9.0
torchvision==0.10.0
In my requirements.txt file I downgraded the torch library, which forced me to downgrade torchvision, and that fixed the error for me. The library versions I ended up using that did not raise the error were:
torch==1.8.1
torchvision==0.9.1
change your code:
np.array(pic ,np.float32)
to:
np.array(pic).astype('float32')

Error while predicting value from saved model

My friend saved a decision tree model using joblib.dump(). But when I tried to predict using some data from the saved model, I got the following error. Can anyone advise as to why this is happening? My friend and I have the same versions of all required libraries
Traceback (most recent call last):
File "testing.py", line 12, in <module>
classifier = joblib.load('saved_model.pkl')
File "C:\Users\naiks\AppData\Local\Programs\Python\Python38-32\lib\site-packages\joblib\numpy_pickle.py", line 585, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "C:\Users\naiks\AppData\Local\Programs\Python\Python38-32\lib\site-packages\joblib\numpy_pickle.py", line 504, in _unpickle
obj = unpickler.load()
File "C:\Users\naiks\AppData\Local\Programs\Python\Python38-32\lib\pickle.py", line 1210, in load
dispatch[key[0]](self)
File "C:\Users\naiks\AppData\Local\Programs\Python\Python38-32\lib\pickle.py", line 1587, in load_reduce
stack[-1] = func(*args)
File "sklearn\tree\_tree.pyx", line 607, in sklearn.tree._tree.Tree.__cinit__
ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'
I was using a 32 bit version of python, whereas the machine on which the model was trained was using 64 bit python. Hence the buffer size was different, resulting in the error

tensorflow.python.framework.errors_impl.UnknownError: Failed to rename: Input/output error

When I apply tensorflow eager mode to train a classifier, I met the following error.
Steps 151, Train loss is 0.00039766659028828144, learning_rate is 0.009999999776482582
Traceback (most recent call last):
File "E:/Tensorflow_Experiments/train_alexnet.py", line 1354, in <module>
The error occurs after several iterations, as shown above. But the iteration number is different for successional executions.
File "C:\Software\Anaconda3\lib\site-packages\tensorflow\contrib\eager\python\saver.py", line 156, in save
None, file_prefix, write_meta_graph=False, global_step=global_step)
File "C:\Software\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1451, in save
save_relative_paths=self._save_relative_paths)
File "C:\Software\Anaconda3\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 237, in update_checkpoint_state_internal
text_format.MessageToString(ckpt))
File "C:\Software\Anaconda3\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 436, in atomic_write_string_to_file
rename(temp_pathname, filename, overwrite)
File "C:\Software\Anaconda3\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 415, in rename
compat.as_bytes(oldname), compat.as_bytes(newname), overwrite, status)
File "C:\Software\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to rename: ./hundred_models\model1\checkpoint.tmpc4b15b8c1e2d48b394f810909a0838b6 to: ./hundred_models\model1\checkpoint : \udcbeܾ\udcf8\udcb7\udcc3\udcceʡ\udca3
; Input/output error
This answer is way to late, but here is what fixed it for me.
Check whether you have a similarly named folder or in my code a csv logger interfered with the creation of the checkpoint.

Categories

Resources