I'm trying to load hdf5 datasets into a pytorch training for loop.
Regardless of num_workers in dataloader, this randomly throws "KeyError: 'Unable to open object (component not found)' " (traceback below).
I'm able to start the training loop, but not able to get through 1/4 of one epoch without this error which happens for random 'datasets' (which are 2darrays each). I'm able to separately load these arrays in the console using the regular f['group/subroup'][()] so it doesn't appear like the hdf file is corrupted or that there's anything wrong with the datasets/array.
I've tried:
adjusting num_workers as per various other issues that people have had with pytorch - still happens with 0 num_workers.
upgrading /downgrading, torch, numpy and python versions.
using f.close() at the end of data loader getitem
using a fresh conda env and installing dependencies.
calling parent groups first, then initialising array eg:
X = f[ID] then X = X[()]
using double slashes in hdf path
Because this recurs with num_workers=0, I figure it's not a multithreading issue although the traceback seems to point to lines from /torch/utils/data/dataloader that prep the next batch.
I just can't figure out why h5py can't see the odd individual dataset, randomly.
IDs are strings to match hdf paths eg:
ID = "ID_12345//Ep_-1//AN_67891011//ABC"
excerpt from dataloader:
def __getitem__(self, index):
ID = self.list_IDs[index]
# Start hdf file in read mode:
f = h5py.File(self.hdf_file, 'r', libver='latest', swmr=True)
X = f[ID][()]
X = X[:, :, np.newaxis] # torchvision 0.2.1 needs (H x W x C) for transforms
y = self.y_list[index]
if self.transform:
X = self.transform(X)
return ID, X, y
`
Expected: training for loop
Actual: IDs / datasets / examples are loaded fine initially, then after between 20 and 200 steps...
Traceback (most recent call last):
File "Documents/BSSA-loc/mamdl/models/main_v3.py", line 287, in
main() File "Documents/BSSA-loc/mamdl/models/main_v3.py", line 203, in main
for i, (IDs, images, labels) in enumerate(train_loader): File "/home/james/anaconda3/envs/jc/lib/python3.7/site-packages/torch/utils/data/dataloader.py",
line 615, in next
batch = self.collate_fn([self.dataset[i] for i in indices]) File
"/home/james/anaconda3/envs/jc/lib/python3.7/site-packages/torch/utils/data/dataloader.py",
line 615, in
batch = self.collate_fn([self.dataset[i] for i in indices]) File
"/home/james/Documents/BSSA-loc/mamdl/src/data_loading/Data_loader_v3.py",
line 59, in getitem
X = f[ID][()] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55,
in h5py._objects.with_phil.wrapper File
"/home/james/anaconda3/envs/jc/lib/python3.7/site-packages/h5py/_hl/group.py",
line 262, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: 'Unable to open object (component not found)'
For the record, my best guess is that this was due a bug in my code for hdf construction, which was stopped and started multiple times in append mode.
Some datasets appeared as though they were complete when queried f['group/subroup'][()] but were not able to loaded with pytorch dataloader.
Haven't had this issue since rebuilding hdf differently.
Related
I'm attempting to create an adversarially debiased bert masked language model using 'AdversarialBiasMitigator' alongside the AllenNLP pretrained MLM (from here: https://storage.googleapis.com/allennlp-public-models/bert-masked-lm-2020-10-07.tar.gz).
The training data I am using is a variation of the WinoBias dataset, edited to work for masked language modelling. The format of this data is a pandas df, with the first column containing the sentences (which already contains [CLS], [SEP], and [MASK] tokens), and the second column containing the target (which is a gendered pronoun).
I have edited the masked_language_model_reader.py to correctly read in my pandas df, and I have edited the adversarial_bias_mitigator config file. The remaining files (adversarial_bias_mitigator.py and masked_language_model.py) I have kept the same, so I think the source of the error must be either in the config or the mlm dataset reader I have created.
The main changes I have made in the dataset reader are changing the tokenizer to PretrainedTransformerTokenizer, and editing the _read() method to the following:
#overrides
def _read(self, file_path: str):
import pandas as pd
data= pd.read_csv(file_path)
targets = data.iloc[:,0].tolist()
sentences = data.iloc[:,1].tolist()
zipped = zip(sentences, targets)
for t, s in zipped:
sentence = s
tokens = self._tokenizer.tokenize(sentence)
target = str(t)
t = Token("[MASK]")
yield self.text_to_instance(sentence, tokens, [target])
The rest I have kept virtually the same as the original masked_language_model_reader.py (https://github.com/allenai/allennlp-models/blob/aed4876f04a73c7effddf41b3164e1fb6fb6c275/allennlp_models/lm/masked_language_model_reader.py). I know the above isn't very pythonic but it is the simplest way I could think of, and my dataset isn't that large (only 1000 sentences) so I don't think it is a problem of computing time.
When running all the relevant files in the CLI, the below error appears:
2021-10-02 10:52:20,351 - INFO -
allennlp.training.gradient_descent_trainer - Training 0it [00:00,
?it/s] loading instances: 0it [00:00, ?it/s] loading instances: 162it
[00:00, 1616.98it/s] loading instances: 324it [00:00, 1545.78it/s]
loading instances: 479it [00:00, 1524.23it/s] loading instances: 681it
[00:00, 1713.15it/s] loading instances: 1049it [00:00, 1764.63it/s]
0it [00:00, ?it/s] 2021-10-02 10:52:20,959 - CRITICAL - root -
Uncaught exception Traceback (most recent call last): File
"/usr/local/bin/allennlp", line 8, in
sys.exit(run()) File "/usr/local/lib/python3.7/dist-packages/allennlp/main.py", line
46, in run
main(prog="allennlp") File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/init.py",
line 122, in main
args.func(args) File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py",
line 121, in train_model_from_args
file_friendly_logging=args.file_friendly_logging, File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py",
line 187, in train_model_from_file
return_model=return_model, File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py",
line 260, in train_model
file_friendly_logging=file_friendly_logging, File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py",
line 504, in _train_worker
metrics = train_loop.run() File "/usr/local/lib/python3.7/dist-packages/allennlp/commands/train.py",
line 577, in run
return self.trainer.train() File "/usr/local/lib/python3.7/dist-packages/allennlp/training/gradient_descent_trainer.py",
line 750, in train
metrics, epoch = self._try_train() File "/usr/local/lib/python3.7/dist-packages/allennlp/training/gradient_descent_trainer.py",
line 773, in _try_train
train_metrics = self._train_epoch(epoch) File "/usr/local/lib/python3.7/dist-packages/allennlp/training/gradient_descent_trainer.py",
line 490, in _train_epoch
batch_outputs = self.batch_outputs(batch, for_training=True) File
"/usr/local/lib/python3.7/dist-packages/allennlp/training/gradient_descent_trainer.py",
line 383, in batch_outputs
output_dict = self._pytorch_model(**batch) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py",
line 1071, in _call_impl
result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/allennlp/fairness/adversarial_bias_mitigator.py",
line 121, in forward
predictor_output_dict = self.predictor.forward(*args, **kwargs) File
"/usr/local/lib/python3.7/dist-packages/allennlp_models/lm/models/masked_language_model.py", line 110, in forward
embeddings = self._text_field_embedder(tokens) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py",
line 1071, in _call_impl
result = forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py",
line 103, in forward
token_vectors = embedder(**tensors, **forward_params_values) File
"/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py",
line 1071, in _call_impl
result = forward_call(*input, **kwargs) TypeError: forward() got an unexpected keyword argument 'tokens'
I can't seem to work out what the problem is. I can't understand why being passed 'tokens' would be a problem? I'm wondering if it the way I am reading in the data and if it isn't correctly being formatted into an instance, but again, I can't seem to see an obvious problem with my method in comparison to the original script as well.
To try to fix the problem I have also added into the config:
"token_indexers": {
"bert": {
"type": "single_id"
}
as well as:
"sorting_keys":["tokens"]
I'm not sure if either of these things are related or helping/worsening the problem!
Thanks for any help.
TypeError: forward() got an unexpected keyword argument 'tokens'
The name of the field in your dataset reader should match the name of the field that the model is expecting. You can check the forward function of the model to get the correct field name.
I am getting random failures to create the callback model weights file when running Keras (with tensorflow) models in a loop with changing model parameters and input data. I am thinking that it might be something to do with the length of my directory name but if so, it seems like a bug as it only happens sometimes. Prior to the following error it wrote multiple files with same length directories. I am using long names for directories to make it easier to distinguish runs in tensorboard. I will show my basic code setup as pseudo code and then the random error that I am getting. I have a nested for loop that is changing model parameters as well as input data. The basic loop will work fine for hours and then randomly fail for the same error at some point in the loop. I would like to know if I am doing something wrong in my file name that is causing this. I would also like a work around so that when it fails, I can keep running and move on to the next file and skip the one that is failing. Some type of try/except but I don't know enough about h5py to know how to code that. I am running on Windows 10 (conda env), tensorflow-gpu 1.6.0, Keras 2.1.5, h5py 2.7.1, tensorboard 1.6.0. I also set up Windows 10 to handle long file names. This error seems to be coming directly from h5py (h5py\h5f.pyx). Also, the file actually gets created and written. I can load the file using h5py.File() and it is the correct size and has same groups and objects. Update: I included the os.makedirs() line in my code that I did not show before. I also added a check on the directory creation and ran the code again. It still failed in same way and it never triggered the isdir() check. Update 2: Wanted to point out that I am using multiprocessing because of memory leaks when using Keras with Tensorflow. This happens regardless of having K.clear_session() and tf.reset_default_graph(). I now believe this random error is related to multiprocessing as I have not observed this error yet when I eliminate the pooling process.
def main():
for input_data in input_data_list:
for model_parameters in model_parameters_list:
# run model with different parameters on all data
pool = multiprocessing.Pool(1)
pool.apply(run_function,run_parameters...,model_func_name,
model_func_dict)
pool.close()
def run_function(run_parameters...,model_func_name,model_func_dict,...):
# code to extract x_train,y_train, x_val, y_val etc not shown
# model_def = long string representing model parameters example below
# model_def =
# 'basic_ff_nn4_mse_dr50_Nadam_LeakyReLU_kr_l2_ar_off_ns_0_BCtoA_all_2_2'
# build and compile model
model = model_func_name(**model_func_dict)
# set up callbacks
os.makedirs(models_dir + "{}_{}_{}_{}/".format(model_def, set_name,
fold, set_num), exist_ok=True)
tmp_path = models_dir + "{}_{}_{}_{}/".format(model_def, set_name, fold,
set_num)
best_weights_file = models_dir + "{}_{}_{}_{}/best_weights.hdf5".format(
model_def, set_name, fold, set_num)
best_model_weights = callbacks.ModelCheckpoint(best_weights_file,
save_best_only=True,
save_weights_only=True)
log_dir = 'output/{}_{}/tf_logs/{}/{}/{}'.format(model_type, cur_time,
model_def, set_name,
'f' + str(fold))
tensorboard = callbacks.TensorBoard(log_dir=log_dir,
histogram_freq=0, write_graph=False,
write_images=False,
write_grads=False)
if not os.path.isdir(tmp_path):
print('path not created = ',tmp_path)
model_history = model.fit(x=x_train, y=y_train,
verbose=0,
batch_size=size_batches,
epochs=num_epochs,
validation_data=[x_val, y_val],
callbacks=[best_model_weights, tensorboard],
)
K.clear_session()
tf.reset_default_graph()
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "C:\Users\xxxxx\Dropbox (test)\Lab\VLL models\zakworkspace\cps\cps_main.py", line 1042, in run_joint_ff
callbacks=[best_model_weights, tensorboard],
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\models.py", line 963, in fit
validation_steps=validation_steps)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\engine\training.py", line 1705, in fit
validation_steps=validation_steps)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\engine\training.py", line 1255, in _fit_loop
callbacks.on_epoch_end(epoch, epoch_logs)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\callbacks.py", line 445, in on_epoch_end
self.model.save_weights(filepath, overwrite=True)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\keras\models.py", line 754, in save_weights
with h5py.File(filepath, 'w') as f:
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\h5py\_hl\files.py", line 269, in __init__
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "C:\ProgramData\Miniconda3\envs\tflow_g\lib\site-packages\h5py\_hl\files.py", line 105, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = 'output/TB_runs_03122018-031837/dump/models/basic_ff_nn4_mse_dr50_Nadam_LeakyReLU_kr_l2_ar_off_ns_0_BCtoA_all_2_2/best_weights.hdf5', errno = 22, error message = 'Invalid argument', flags = 13, o_flags = 302)
"""
I know this post is rather old but I just ran into this problem myself and managed to find a solution. Perhaps it is also useful for others.
This error occurs because the nc file has been locked. It is necessary/possible to turn off 'file locking' by typing into the terminal:
export HDF5_USE_FILE_LOCKING=FALSE
see more explanation on NetCDF / HDF5 file locking here.
I'm using python tensorflow to train a model to recognise images in python. But I'm getting the below error when trying to execute train.py from github
Traceback (most recent call last):
File "train.py", line 1023, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "C:\Users\sande\Anaconda3\envs\tensorflow\lib\site-
packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 766, in main
bottleneck_tensor)
File "train.py", line 393, in cache_bottlenecks
jpeg_data_tensor, bottleneck_tensor)
File "train.py", line 341, in get_or_create_bottleneck
bottleneck_tensor)
File "train.py", line 290, in create_bottleneck_file
print('Creating bottleneck at ' + bottleneck_path)
OSError: raw write() returned invalid length 112 (should have been between 0
and 56)
Below is the code for create_bottleneck_file()
def create_bottleneck_file(bottleneck_path, image_lists, label_name, index,
image_dir, category, sess, jpeg_data_tensor,
bottleneck_tensor):
"""Create a single bottleneck file."""
print('Creating bottleneck at ' + bottleneck_path)
image_path = get_image_path(image_lists, label_name, index,
image_dir, category)
if not gfile.Exists(image_path):
tf.logging.fatal('File does not exist %s', image_path)
image_data = gfile.FastGFile(image_path, 'rb').read()
try:
bottleneck_values = run_bottleneck_on_image(
sess, image_data, jpeg_data_tensor, bottleneck_tensor)
except:
raise RuntimeError('Error during processing file %s' % image_path)
bottleneck_string = ','.join(str(x) for x in bottleneck_values)
with open(bottleneck_path, 'w') as bottleneck_file:
bottleneck_file.write(bottleneck_string)
I tried reducing the file names so that bottleneck_path will be a small value but that did not work. I tried to search online for this error but did not find anything useful. Please let me know if you have a fix to this issue
If you're unable to migrate to 3.6 or from Windows like me, install the win_unicode_console package, import it and add this line at the beggining of your script to enable it:
win_unicode_console.enable()
This issue appears to be generally unique to pre-3.6 Python as the code responsible for handling text output was rewritten for this latest version. This also means that we will most likely not see a fix coming for this issue.
Source: https://bugs.python.org/issue32245
I think this is a bug on the stdout/stderr streams introduced by the November's creators update, it happens in both powershell.exe and cmd.exe
It seems to only happen on Windows 10 Version 1709 (OS Build 16299.64). My guess is that it is unicode realted (output size is twice the length expected)
A (very) quick and dirty fix is to only output ASCII on your console :
mystring.encode("utf-8").decode("ascii")
https://github.com/Microsoft/vscode/issues/39149#issuecomment-347260954
Adding more to #AMSAntiago answer. You could run the win_unicode_console.enable(). But instead of using it on every file, you could run it on every Python invocation (docs). That works for me.
I have a result file from Dymola (.mat v4) which stores all variables in a huge 1D array (More or less 2GB of data in one array...). I can't do anything about the file format as we are bound to use Dymola. When trying to read the file using scipy (with Python 2.7.13 64bit), I get the following error:
C:\Users\...\scipy\io\matlab\mio4.py:352: RuntimeWarning: overflow encountered
in long_scalars
remaining_bytes = hdr.dtype.itemsize * n
C:\...\scipy\io\matlab\mio4.py:172: RuntimeWarning: overflow
encountered in long_scalars
num_bytes *= d
Traceback (most recent call last):
File
...
self.mat = scipy.io.loadmat(fileName, chars_as_strings=False)
File "C:\...\scipy\io\matlab\mio.py", line 136, in loadmat
matfile_dict = MR.get_variables(variable_names)
File "C:\...\scipy\io\matlab\mio4.py", line 399, in get_variables
mdict[name] = self.read_var_array(hdr)
File "C:\...\scipy\io\matlab\mio4.py", line 374, in read_var_array
return self._matrix_reader.array_from_header(header, process)
File "C:\...\scipy\io\matlab\mio4.py", line 137, in array_from_header
arr = self.read_full_array(hdr)
File "C:\...\scipy\io\matlab\mio4.py", line 207, in read_full_array
return self.read_sub_array(hdr)
File "C:\...\scipy\io\matlab\mio4.py", line 178, in read_sub_array
"`variable_names` kwarg to `loadmat`" % hdr.name)
ValueError: Not enough bytes to read matrix 'data_2'; is this a badly-formed
file? Consider listing matrices with `whosmat` and loading named matrices with `variable_names` kwarg to `loadmat`
The error/problem is pretty clear to me. My question: Are there any workarounds? Can I still read the file and get the data? Is it possible to split the array while reading it?
I suggest you turn on conversion to SDF file format which is based on HDF5. This format can better handle large files. See Simulation/Setup.
Alternatively you can reduce the number of variables stored in the file using Variable Selections in Dymola.
Running into a very strange issue when trying to create a rather large numpy ndarray dataset.
e.g.
import h5py
import numpy as np
test_h5=h5py.File('test.hdf5','w')
n=3055693983 # fail
n=10000000000 # works
n=40000000000 # fail
n=100000000000 # works
n=20000000000 #fail
n=512 # works
test_h5.create_dataset('matrix', shape=(n,n), dtype=np.int8, compression='gzip', chunks=(256,256))
print(test_h5['matrix'].shape)
a=test_h5['matrix']
a[0:256,0:256]=np.ones((256,256))
Chunk size is (256,256).
If the above ndarray is set to (512,512), everything works AOK.
If the above ndarray is set to (100000000000,100000000000), everything works AOK...
Ideally I wanted a ndarray of size (3055693983,3055693983) which fails with the following:
(3055693983, 3055693983)
Traceback (most recent call last):
File "h5.py", line 16, in
a[0:256,0:256]=np.ones((256,256))
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2696)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2654)
File "/home/user/anaconda2/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 618, in __setitem__
self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2696)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/work/h5py/_objects.c:2654)
File "h5py/h5d.pyx", line 221, in h5py.h5d.DatasetID.write (/home/ilan/minonda/conda-bld/work/h5py/h5d.c:3527)
File "h5py/_proxy.pyx", line 132, in h5py._proxy.dset_rw (/home/ilan/minonda/conda-bld/work/h5py/_proxy.c:1889)
File "h5py/_proxy.pyx", line 93, in h5py._proxy.H5PY_H5Dwrite (/home/ilan/minonda/conda-bld/work/h5py/_proxy.c:1599)
IOError: Can't prepare for writing data (Can't retrieve number of elements in file dataset)
Setting the ndarray to a few random sizes produced mixed results. Some work, some do not... I thought it may be something simple like the ndarray size not being evenly divisible by the chunk_size, but that does not appear to be the issue.
What am I missing here?