Train, test and validate generators - python

I've been working in my model.
I came to a point in which I've to use training, test and Validation generators like this:
def get_train_generator(df, image_dir, x_col, y_cols, shuffle=True, batch_size=8, seed=1, target_w = 320, target_h = 320):
"""
Return generator for training set, normalizing using batch
statistics.
Args:
train_df (dataframe): dataframe specifying training data.
image_dir (str): directory where image files are held.
x_col (str): name of column in df that holds filenames.
y_cols (list): list of strings that hold y labels for images.
batch_size (int): images per batch to be fed into model during training.
seed (int): random seed.
target_w (int): final width of input images.
target_h (int): final height of input images.
Returns:
train_generator (DataFrameIterator): iterator over training set
"""
print("getting train generator...")
# normalize images
image_generator = ImageDataGenerator(
samplewise_center=True,
samplewise_std_normalization= True)
# flow from directory with specified batch size
# and target image size
generator = image_generator.flow_from_dataframe(
dataframe=df,
directory=image_dir,
x_col=x_col,
y_col=y_cols,
class_mode="raw",
batch_size=batch_size,
shuffle=shuffle,
seed=seed,
target_size=(target_w,target_h))
return generator
this work quite good!
but when I came to the test and valid generator as:
def get_test_and_valid_generator(valid_df, test_df, train_df, image_dir, x_col, y_cols, sample_size=100, batch_size=8, seed=1, target_w = 320, target_h = 320):
"""
Return generator for validation set and test set using
normalization statistics from training set.
Args:
valid_df (dataframe): dataframe specifying validation data.
test_df (dataframe): dataframe specifying test data.
train_df (dataframe): dataframe specifying training data.
image_dir (str): directory where image files are held.
x_col (str): name of column in df that holds filenames.
y_cols (list): list of strings that hold y labels for images.
sample_size (int): size of sample to use for normalization statistics.
batch_size (int): images per batch to be fed into model during training.
seed (int): random seed.
target_w (int): final width of input images.
target_h (int): final height of input images.
Returns:
test_generator (DataFrameIterator) and valid_generator: iterators over test set and validation set respectively
"""
print("getting train and valid generators...")
# get generator to sample dataset
raw_train_generator = ImageDataGenerator().flow_from_dataframe(
dataframe=train_df,
directory=IMAGE_DIR,
x_col="Image",
y_col=labels,
class_mode="raw",
batch_size=sample_size,
shuffle=True,
target_size=(target_w, target_h))
# get data sample
batch = raw_train_generator.next()
data_sample = batch[0]
# use sample to fit mean and std for test set generator
image_generator = ImageDataGenerator(
featurewise_center=True,
featurewise_std_normalization= True)
# fit generator to sample from training data
image_generator.fit(data_sample)
# get test generator
valid_generator = image_generator.flow_from_dataframe(
dataframe=valid_df,
directory=image_dir,
x_col=x_col,
y_col=y_cols,
class_mode="raw",
batch_size=batch_size,
shuffle=False,
seed=seed,
target_size=(target_w,target_h))
test_generator = image_generator.flow_from_dataframe(
dataframe=test_df,
directory=image_dir,
x_col=x_col,
y_col=y_cols,
class_mode="raw",
batch_size=batch_size,
shuffle=False,
seed=seed,
target_size=(target_w,target_h))
return valid_generator, test_generator
so when I enter my data as :
IMAGE_DIR = "/Users/awabe/Desktop/Project/PapilaDB/Image"
train_generator = get_train_generator(train_df, IMAGE_DIR, "Image", labels)
it gives me:
getting train generator...
Found 488 validated image filenames.
and this is good also.
but in the test and valid section I get:
IMAGE_DIR = "/Users/awabe/Desktop/Project/PapilaDB/Image"
train_generator = get_train_generator(train_df, IMAGE_DIR, "Image", labels)
valid_generator, test_generator= get_test_and_valid_generator(valid_df, test_df, train_df, IMAGE_DIR, "Image", labels)
the error is:
KeyError Traceback (most recent call last)
File /opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
3802 try:
-> 3803 return self._engine.get_loc(casted_key)
3804 except KeyError as err:
File /opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File /opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/_libs/index.pyx:146, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index_class_helper.pxi:49, in pandas._libs.index.Int64Engine._check_type()
KeyError: 'Image'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[201], line 4
1 IMAGE_DIR = "/Users/awabe/Desktop/Project/PapilaDB/Image"
3 train_generator = get_train_generator(train_df, IMAGE_DIR, "Image", labels)
----> 4 valid_generator, test_generator= get_test_and_valid_generator(valid_df, test_df, train_df, IMAGE_DIR, "Image", labels)
Cell In[194], line 47, in get_test_and_valid_generator(valid_df, test_df, train_df, image_dir, x_col, y_cols, sample_size, batch_size, seed, target_w, target_h)
44 image_generator.fit(data_sample)
46 # get test generator
---> 47 valid_generator = image_generator.flow_from_dataframe(
48 dataframe=valid_df,
49 directory=image_dir,
50 x_col=x_col,
51 y_col=y_cols,
52 class_mode="raw",
53 batch_size=batch_size,
54 shuffle=False,
55 seed=seed,
56 target_size=(target_w,target_h))
58 test_generator = image_generator.flow_from_dataframe(
59 dataframe=test_df,
60 directory=image_dir,
(...)
66 seed=seed,
67 target_size=(target_w,target_h))
68 return valid_generator, test_generator
File /opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/keras/preprocessing/image.py:1808, in ImageDataGenerator.flow_from_dataframe(self, dataframe, directory, x_col, y_col, weight_col, target_size, color_mode, classes, class_mode, batch_size, shuffle, seed, save_to_dir, save_prefix, save_format, subset, interpolation, validate_filenames, **kwargs)
1801 if "drop_duplicates" in kwargs:
1802 warnings.warn(
1803 "drop_duplicates is deprecated, you can drop duplicates "
1804 "by using the pandas.DataFrame.drop_duplicates method.",
1805 DeprecationWarning,
1806 )
-> 1808 return DataFrameIterator(
1809 dataframe,
1810 directory,
1811 self,
1812 x_col=x_col,
1813 y_col=y_col,
1814 weight_col=weight_col,
1815 target_size=target_size,
1816 color_mode=color_mode,
1817 classes=classes,
1818 class_mode=class_mode,
1819 data_format=self.data_format,
1820 batch_size=batch_size,
1821 shuffle=shuffle,
1822 seed=seed,
1823 save_to_dir=save_to_dir,
1824 save_prefix=save_prefix,
1825 save_format=save_format,
1826 subset=subset,
1827 interpolation=interpolation,
1828 validate_filenames=validate_filenames,
1829 dtype=self.dtype,
1830 )
File /opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/keras/preprocessing/image.py:968, in DataFrameIterator.__init__(self, dataframe, directory, image_data_generator, x_col, y_col, weight_col, target_size, color_mode, classes, class_mode, batch_size, shuffle, seed, data_format, save_to_dir, save_prefix, save_format, subset, interpolation, keep_aspect_ratio, dtype, validate_filenames)
966 self.dtype = dtype
967 # check that inputs match the required class_mode
--> 968 self._check_params(df, x_col, y_col, weight_col, classes)
969 if (
970 validate_filenames
971 ): # check which image files are valid and keep them
972 df = self._filter_valid_filepaths(df, x_col)
File /opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/keras/preprocessing/image.py:1029, in DataFrameIterator._check_params(self, df, x_col, y_col, weight_col, classes)
1023 raise TypeError(
1024 'If class_mode="{}", y_col must be a list. Received {}.'.format(
1025 self.class_mode, type(y_col).__name__
1026 )
1027 )
1028 # check that filenames/filepaths column values are all strings
-> 1029 if not all(df[x_col].apply(lambda x: isinstance(x, str))):
1030 raise TypeError(
1031 "All values in column x_col={} must be strings.".format(x_col)
1032 )
1033 # check labels are string if class_mode is binary or sparse
File /opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
3803 if self.columns.nlevels > 1:
3804 return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
3806 if is_integer(indexer):
3807 indexer = [indexer]
File /opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key, method, tolerance)
3803 return self._engine.get_loc(casted_key)
3804 except KeyError as err:
-> 3805 raise KeyError(key) from err
3806 except TypeError:
3807 # If we have a listlike key, _check_indexing_error will raise
3808 # InvalidIndexError. Otherwise we fall through and re-raise
3809 # the TypeError.
3810 self._check_indexing_error(key)
KeyError: 'Image'
so what can Ido? I've tried different ways but still no action.
is there any way to generate it in another form?

Related

Key Error - Using GroupShuffleSplit to generate training and validation sets

I am trying to use a custom CNN to classify spectrogram images generated for 3s audio segments. I am using GroupShuffleSplit to divide the training dataset into a training set and a validation set and to ensure that each participant is only included in one set (to prevent data leakage). I am using a Custom Class, to load the audio files and augment the training set.
If I generate a SoundDS object using the training set, randomly split the object into a training and validation subset, and pass these subsets into two respective data loaders, my model trains without any issues.
However, if I initially use GroupShuffleSplit on the dataframe train_df (grouping by the column addressfname), then generate two SoundDS objects and pass the train_DS and validation_DS into two respective data loaders, I encounter the error message shown at the bottom of this post when I try to run for i, data in enumerate(train_dl) in my train function. Does anyone know where the issue lies?
from torch.utils.data import DataLoader, Dataset, random_split
import torchaudio
# ----------------------------
# Sound Dataset
# ----------------------------
class SoundDS(Dataset):
def __init__(self, df):
self.df = df
self.duration = 3000
self.sr = 44100
self.channel = 2
self.shift_pct = 0.4
# ----------------------------
# Number of items in dataset
# ----------------------------
def __len__(self):
return len(self.df)
# ----------------------------
# Get i'th item in dataset
# ----------------------------
def __getitem__(self, idx):
audio_file = self.df.loc[idx, "relative_path"]
class_id = self.df.loc[idx, "dx"]
aud = AudioUtil.open(audio_file)
reaud = AudioUtil.resample(aud, self.sr)
rechan = AudioUtil.rechannel(reaud, self.channel)
dur_aud = AudioUtil.pad_trunc(rechan, self.duration)
shift_aud = AudioUtil.time_shift(dur_aud, self.shift_pct)
sgram = AudioUtil.spectro_gram(shift_aud, n_mels=64, n_fft=1024, hop_len=None)
aug_sgram = AudioUtil.spectro_augment(sgram, max_mask_pct=0.1, n_freq_masks=2, n_time_masks=2)
return aug_sgram, class_id
This method which randomly splits the training set into a training and validation set works.
from torch.utils.data import random_split
myds = SoundDS(train_df)
# Random split of 80:20 between training and validation
num_items = len(myds)
num_train = round(num_items * 0.8)
num_val = num_items - num_train
train_ds, val_ds = random_split(myds, [num_train, num_val])
# Create training and validation data loaders
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=16, shuffle=True)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=16, shuffle=False)
This method which uses GroupShuffleSplit to splits the training set into a training and validation set and then generates two SoundDS objects does not work.
from sklearn.model_selection import GroupShuffleSplit
splitter = GroupShuffleSplit(test_size=0.15, n_splits=1, random_state = 7)
split = splitter.split(train_df, groups=train_df['adressfname'])
train_inds, valid_inds = next(split)
train_data_df = train_df.iloc[train_inds]
valid_data_df = train_df.iloc[valid_inds]
train_dataset_DS = SoundDS(train_data_df)
valid_dataset_DS = SoundDS(valid_data_df)
train_dl = torch.utils.data.DataLoader(train_dataset_DS, batch_size=16, shuffle=True)
val_dl = torch.utils.data.DataLoader(valid_dataset_DS, batch_size=16, shuffle=False)
Function to train the model
def training(model, train_dl, num_epochs):
# Loss Function, Optimizer and Scheduler
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.001,
steps_per_epoch=int(len(train_dl)),
epochs=num_epochs,
anneal_strategy='linear')
# Repeat for each epoch
for epoch in range(num_epochs):
running_loss = 0.0
correct_prediction = 0
total_prediction = 0
# Repeat for each batch in the training set
for i, data in enumerate(train_dl): ---- This is where the issue occurs
# Get the input features and target labels, and put them on the GPU
inputs, labels = data[0].to(device), data[1].to(device)
# Normalize the inputs
inputs_m, inputs_s = inputs.mean(), inputs.std()
inputs = (inputs - inputs_m) / inputs_s
This is the error message:
KeyError
Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
15 frames
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 3170
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-201-b60077dcefb4> in <module>
54
55 num_epochs=2 # Just for demo, adjust this higher.
---> 56 training(myModel, train_dl, num_epochs)
<ipython-input-201-b60077dcefb4> in training(model, train_dl, num_epochs)
15
16 # Repeat for each batch in the training set
---> 17 for i, data in enumerate(train_dl):
18 # Get the input features and target labels, and put them on the GPU
19 inputs, labels = data[0].to(device), data[1].to(device)
/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py in __next__(self)
626 # TODO(https://github.com/pytorch/pytorch/issues/76750)
627 self._reset() # type: ignore[call-arg]
--> 628 data = self._next_data()
629 self._num_yielded += 1
630 if self._dataset_kind == _DatasetKind.Iterable and \
/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
669 def _next_data(self):
670 index = self._next_index() # may raise StopIteration
--> 671 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
672 if self._pin_memory:
673 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
56 data = self.dataset.__getitems__(possibly_batched_index)
57 else:
---> 58 data = [self.dataset[idx] for idx in possibly_batched_index]
59 else:
60 data = self.dataset[possibly_batched_index]
/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
56 data = self.dataset.__getitems__(possibly_batched_index)
57 else:
---> 58 data = [self.dataset[idx] for idx in possibly_batched_index]
59 else:
60 data = self.dataset[possibly_batched_index]
<ipython-input-191-4c84224a9983> in __getitem__(self, idx)
27
28 # print(self.df.loc[idx, 'relative_path'])
---> 29 audio_file = self.df.loc[idx, "relative_path"]
30 class_id = self.df.loc[idx, "dx"]
31 # participant_id = self.df.loc[idx, 'adressfname']
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
923 with suppress(KeyError, IndexError):
924 return self.obj._get_value(*key, takeable=self._takeable)
--> 925 return self._getitem_tuple(key)
926 else:
927 # we by definition only have the 0th axis
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1098 def _getitem_tuple(self, tup: tuple):
1099 with suppress(IndexingError):
-> 1100 return self._getitem_lowerdim(tup)
1101
1102 # no multi-index, so validate all of the indexers
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
836 # We don't need to check for tuples here because those are
837 # caught by the _is_nested_tuple_indexer check above.
--> 838 section = self._getitem_axis(key, axis=i)
839
840 # We should never have a scalar section here, because
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1162 # fall thru to straight lookup
1163 self._validate_key(key, axis)
-> 1164 return self._get_label(key, axis=axis)
1165
1166 def _get_slice_axis(self, slice_obj: slice, axis: int):
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis)
1111 def _get_label(self, label, axis: int):
1112 # GH#5667 this will fail if the label is not present in the axis.
-> 1113 return self.obj.xs(label, axis=axis)
1114
1115 def _handle_lowerdim_multi_index_axis0(self, tup: tuple):
/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
3774 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
3775 else:
-> 3776 loc = index.get_loc(key)
3777
3778 if isinstance(loc, np.ndarray):
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 3170
Does anyone have any idea why one approach works without any issues, but the other does not?

how do i solve Key 8 error while using pytorch?

from torch.utils.data import (TensorDataset, DataLoader, RandomSampler,
SequentialSampler)
def data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50):
"""
Convert train and validation sets to torch.Tensors and load them to DataLoader.
"""
# Convert data type to torch.Tensor
train_inputs, val_inputs, train_labels, val_labels =\
tuple(torch.tensor(data) for data in
[train_inputs, val_inputs, train_labels, val_labels])
# Specify batch_size
batch_size = 50
# Create DataLoader for training data
train_data = TensorDataset(train_inputs, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler,
batch_size=batch_size)
# Create DataLoader for validation data
val_data = TensorDataset(val_inputs, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
return train_dataloader, val_dataloader
The code works fine when the train_inputs and val_inputs tensors are of of type int64, but doesn't when the type is int32.
Can someone tell me what's wrong here?
ERROR:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py:3621, in Index.get_loc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
File ~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()
File ~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()
File pandas\_libs\hashtable_class_helper.pxi:2131, in pandas._libs.hashtable.Int64HashTable.get_item()
File pandas\_libs\hashtable_class_helper.pxi:2140, in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 8
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Input In [31], in <cell line: 6>()
2 train_inputs, val_inputs, train_labels, val_labels = train_test_split(
3 input_ids, labels, test_size=0.1, random_state=42)
5 # Load data to PyTorch DataLoader
----> 6 train_dataloader, val_dataloader = data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50)
Input In [28], in data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size)
6 """Convert train and validation sets to torch.Tensors and load them to
7 DataLoader.
8 """
10 # Convert data type to torch.Tensor
11 train_inputs, val_inputs, train_labels, val_labels =\
---> 12 tuple(torch.tensor(data) for data in
13 [train_inputs, val_inputs, train_labels, val_labels])
15 # Specify batch_size
16 batch_size = 50
Input In [28], in <genexpr>(.0)
6 """Convert train and validation sets to torch.Tensors and load them to
7 DataLoader.
8 """
10 # Convert data type to torch.Tensor
11 train_inputs, val_inputs, train_labels, val_labels =\
---> 12 tuple(torch.tensor(data) for data in
13 [train_inputs, val_inputs, train_labels, val_labels])
15 # Specify batch_size
16 batch_size = 50
File ~\Anaconda3\lib\site-packages\pandas\core\series.py:958, in Series.__getitem__(self, key)
955 return self._values[key]
957 elif key_is_scalar:
--> 958 return self._get_value(key)
960 if is_hashable(key):
961 # Otherwise index.get_value will raise InvalidIndexError
962 try:
963 # For labels that don't resolve as scalars like tuples and frozensets
File ~\Anaconda3\lib\site-packages\pandas\core\series.py:1069, in Series._get_value(self, label, takeable)
1066 return self._values[label]
1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
1070 return self.index._get_values_for_loc(self, loc, label)
File ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py:3623, in Index.get_loc(self, key, method, tolerance)
3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
-> 3623 raise KeyError(key) from err
3624 except TypeError:
3625 # If we have a listlike key, _check_indexing_error will raise
3626 # InvalidIndexError. Otherwise we fall through and re-raise
3627 # the TypeError.
3628 self._check_indexing_error(key)
KeyError: 8
I was using the same code on my data set and had the same issue. I did 2 things. changed the random_state to not be 42 (which probably wasn't what fixed it) and I also changed my labels to np.array and now it works

My model.fit and model.evaluate are not working properly and I am getting an error

train_x = train['text']
valid_x = valid["text"]
train_y = train["label"]
valid_y = valid["label"]
train_y = train_y.values.reshape(-1,1)
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(train_x)
x_train_count = vectorizer.fit_transform(train_x)
x_valid_count = vectorizer.fit_transform(valid_x)
x_test_count = vectorizer.fit_transform(test["text"])
model = tf.keras.Sequential()
model.add(Dense(50,input_dim=x_train_count.shape[1], kernel_initializer="uniform", activation="relu"))
model.add(Dense(1, kernel_initializer="uniform", activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
# # Fit the model
history = model.fit(x_train_count, train_y, validation_data=(x_valid_count,valid_y), epochs=3, batch_size=128)
loss, acc = model.evaluate(x_test_count, test["label"], verbose=0)
Error
--------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-1-be49478b4c50> in <module>
71 model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
72 # # Fit the model
---> 73 history = model.fit(x_train_count, train_y, validation_data=(x_valid_count,valid_y), epochs=3, batch_size=128)
74
75 loss, acc = model.evaluate(x_test_count, test["label"], verbose=0)
~\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1048 training_utils.RespectCompiledTrainableState(self):
1049 # Creates a `tf.data.Dataset` and handles batch and epoch iteration.
-> 1050 data_handler = data_adapter.DataHandler(
1051 x=x,
1052 y=y,
~\anaconda3\lib\site-packages\tensorflow\python\keras\engine\data_adapter.py in __init__(self, x, y, sample_weight, batch_size, steps_per_epoch, initial_epoch, epochs, shuffle, class_weight, max_queue_size, workers, use_multiprocessing, model, steps_per_execution)
1098
1099 adapter_cls = select_data_adapter(x, y)
-> 1100 self._adapter = adapter_cls(
1101 x,
1102 y,
~\anaconda3\lib\site-packages\tensorflow\python\keras\engine\data_adapter.py in __init__(self, x, y, sample_weights, sample_weight_modes, batch_size, steps, shuffle, **kwargs)
564 inputs = pack_x_y_sample_weight(x, y, sample_weights)
565
--> 566 dataset = dataset_ops.DatasetV2.from_tensor_slices(inputs)
567 num_samples = int(nest.flatten(x)[0].shape[0])
568 if shuffle:
~\anaconda3\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py in from_tensor_slices(tensors)
689 Dataset: A `Dataset`.
690 """
--> 691 return TensorSliceDataset(tensors)
692
693 class _GeneratorState(object):
~\anaconda3\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py in __init__(self, element)
3155 element = structure.normalize_element(element)
3156 batched_spec = structure.type_spec_from_value(element)
-> 3157 self._tensors = structure.to_batched_tensor_list(batched_spec, element)
3158 self._structure = nest.map_structure(
3159 lambda component_spec: component_spec._unbatch(), batched_spec) # pylint: disable=protected-access
~\anaconda3\lib\site-packages\tensorflow\python\data\util\structure.py in to_batched_tensor_list(element_spec, element)
362 # pylint: disable=protected-access
363 # pylint: disable=g-long-lambda
--> 364 return _to_tensor_list_helper(
365 lambda state, spec, component: state + spec._to_batched_tensor_list(
366 component), element_spec, element)
~\anaconda3\lib\site-packages\tensorflow\python\data\util\structure.py in _to_tensor_list_helper(encode_fn, element_spec, element)
337 return encode_fn(state, spec, component)
338
--> 339 return functools.reduce(
340 reduce_fn, zip(nest.flatten(element_spec), nest.flatten(element)), [])
341
~\anaconda3\lib\site-packages\tensorflow\python\data\util\structure.py in reduce_fn(state, value)
335 def reduce_fn(state, value):
336 spec, component = value
--> 337 return encode_fn(state, spec, component)
338
339 return functools.reduce(
~\anaconda3\lib\site-packages\tensorflow\python\data\util\structure.py in <lambda>(state, spec, component)
363 # pylint: disable=g-long-lambda
364 return _to_tensor_list_helper(
--> 365 lambda state, spec, component: state + spec._to_batched_tensor_list(
366 component), element_spec, element)
367
~\anaconda3\lib\site-packages\tensorflow\python\framework\sparse_tensor.py in _to_batched_tensor_list(self, value)
367 raise ValueError(
368 "Unbatching a sparse tensor is only supported for rank >= 1")
--> 369 return [gen_sparse_ops.serialize_many_sparse(
370 value.indices, value.values, value.dense_shape,
371 out_type=dtypes.variant)]
~\anaconda3\lib\site-packages\tensorflow\python\ops\gen_sparse_ops.py in serialize_many_sparse(sparse_indices, sparse_values, sparse_shape, out_type, name)
493 return _result
494 except _core._NotOkStatusException as e:
--> 495 _ops.raise_from_not_ok_status(e, name)
496 except _core._FallbackException:
497 pass
~\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in raise_from_not_ok_status(e, name)
6860 message = e.message + (" name: " + name if name is not None else "")
6861 # pylint: disable=protected-access
-> 6862 six.raise_from(core._status_to_exception(e.code, message), None)
6863 # pylint: enable=protected-access
6864
~\anaconda3\lib\site-packages\six.py in raise_from(value, from_value)
InvalidArgumentError: indices[1] = [0,40295] is out of order. Many sparse ops require sorted indices.
Use tf.sparse.reorder to create a correctly ordered copy.
The object returned by the fit_transform() and fit() methods of TfIdfVectorizer are compressed sparse row format matrix. So in order to convert them into dense matrix use .toarray() function.
Furthermore, you should fit your TfIdfVectorizer on the train_set only and then use it to transform your validation and test set without re-fitting it every time to avoid to use data from test set since it might introduce some data leakage and yield in too optimistic performances. Also, since the test and validation sets are usually small fitting a TfIdf solely on them will result in poor vectorization. Finally, and the Idf factor will be different for the same words in training, validation and test sets, and this is usually not desired.
For the mentioned reasons I would suggest to fit the TfIdfVectorizer only on the training set and then use the fitted vectorizer to transform the validation and test sets.
Here an example:
vectorizer = TfidfVectorizer()
x_train_count = vectorizer.fit_transform(train_x).toarray()
x_valid_count = vectorizer.transform(valid_x).toarray()
x_test_count = vectorizer.transform(test["text"]).toarray()
If your data do not fit into memory, you could consider to iteratively convert to a dense matrix only one batch at the time. Here a function to create a batch generator from a sparse x_train matrix:
def sparse_matrix_batch_generator(X, y, batch_size=32):
num_samples = np.shape(y)[0]
shuffled_index = np.arange(num_samples)
np.random.shuffle(shuffled_index)
X = X[shuffled_index, :]
y = y[shuffled_index]
for n in range(num_samples//batch_size):
batch_index = shuffled_index[n*batch_size:(n+1)*batch_size]
X_batch = X[batch_index, :].todense()
y_batch = y[batch_index]
yield (X_batch, y_batch)

ValueError: features should be a dictionary of `Tensor`s. Given type: <class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>

I am new to machine learning and currently following a tutorial in jupyter notebook.
https://www.tensorflow.org/tutorials/estimator/linear
However, I kept getting this error and was unable to predict the accuracy. I have tried to google for the solution and they said it could be outdated TF version. However, my TF is currently at version '2.0.0-alpha0' and I am using python 3.7.4.
Thank you for your time!
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']
feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
vocabulary = dftrain[feature_name].unique() # gets a list of all unique values from given feature column
feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))
for feature_name in NUMERIC_COLUMNS:
feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))
# The cryptic lines of code inside the append() create an object that our model can use to map string values like "male" and "female" to integers.
# This allows us to avoid manually having to encode our dataframes.
# https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list?version=stable
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
def input_function(): # inner function, this will be returned
ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) # create tf.data.Dataset object with data and its label
if shuffle:
ds = ds.shuffle(1000) # randomize order of data
ds = ds.batch(batch_size).repeat(num_epochs) # split dataset into batches of 32 and repeat process for number of epochs
return ds # return a batch of the dataset
return input_function # return a function object for use
train_input_fn = make_input_fn(dftrain, y_train) # here we will call the input_function that was returned to us to get a dataset object we can feed to the model
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
# We create a linear estimtor by passing the feature columns we created earlier
linear_est.train(train_input_fn) # train
result = linear_est.evaluate(eval_input_fn) # get model metrics/stats by testing on tetsing data
clear_output() # clears consoke output
print(result['accuracy']) # the result variable is simply a dict of stats about our model
INFO:tensorflow:Calling model_fn.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-1a944b4b2878> in <module>
----> 1 linear_est.train(train_input_fn) # train
2 result = linear_est.evaluate(eval_input_fn) # get model metrics/stats by testing on tetsing data
3
4 clear_output() # clears consoke output
5 print(result['accuracy']) # the result variable is simply a dict of stats about our model
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
352
353 saving_listeners = _check_listeners_type(saving_listeners)
--> 354 loss = self._train_model(input_fn, hooks, saving_listeners)
355 logging.info('Loss for final step: %s.', loss)
356 return self
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py in _train_model(self, input_fn, hooks, saving_listeners)
1181 return self._train_model_distributed(input_fn, hooks, saving_listeners)
1182 else:
-> 1183 return self._train_model_default(input_fn, hooks, saving_listeners)
1184
1185 def _train_model_default(self, input_fn, hooks, saving_listeners):
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners)
1211 worker_hooks.extend(input_hooks)
1212 estimator_spec = self._call_model_fn(
-> 1213 features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
1214 global_step_tensor = training_util.get_global_step(g)
1215 return self._train_with_estimator_spec(estimator_spec, worker_hooks,
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py in _call_model_fn(self, features, labels, mode, config)
1169
1170 logging.info('Calling model_fn.')
-> 1171 model_fn_results = self._model_fn(features=features, **kwargs)
1172 logging.info('Done calling model_fn.')
1173
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\canned\linear.py in _model_fn(features, labels, mode, config)
337 partitioner=partitioner,
338 config=config,
--> 339 sparse_combiner=sparse_combiner)
340
341 super(LinearClassifier, self).__init__(
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_estimator\python\estimator\canned\linear.py in _linear_model_fn(features, labels, mode, head, feature_columns, optimizer, partitioner, config, sparse_combiner)
141 if not isinstance(features, dict):
142 raise ValueError('features should be a dictionary of `Tensor`s. '
--> 143 'Given type: {}'.format(type(features)))
144
145 optimizer = optimizers.get_optimizer_instance(
ValueError: features should be a dictionary of `Tensor`s. Given type: <class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>
You haven't specified the feature_columns parameter in estimater.train()
See example here:
https://www.tensorflow.org/guide/data#tfestimator
You need to declare the feature names and data types like:
embark = tf.feature_column.categorical_column_with_hash_bucket('embark_town', 32)
cls = tf.feature_column.categorical_column_with_vocabulary_list('class', ['First', 'Second', 'Third'])
age = tf.feature_column.numeric_column('age')
and pass them in.

Tensorflow DNNclassifier: error wile training (numpy.ndarray has no attribute index)

I am trying to train a DNNClassifier in tensorflow
Here is my code
train_input_fn = tf.estimator.inputs.pandas_input_fn(
x=X_train,
y=y_train,
batch_size=1000,
shuffle = True
)
nn_classifier = tf.estimator.DNNClassifier(hidden_units=[1300,1300,1300], feature_columns=X_train, n_classes=200)
nn_classifier.train(input_fn = train_input_fn, steps=2000)
Here is how y_train looks
[450 450 450 ... 327 327 327]
type : numpy.ndarray
And here is how X_train looks
[[ 9.79285 11.659035 1.279528 ... 1.258979 1.063923 -2.45522 ]
[ 8.711333 13.92955 1.117603 ... 3.588921 1.231256 -3.180302]
[ 5.159803 14.059619 1.740708 ... 0.28172 -0.506701 -1.326669]
...
[ 2.418473 0.542642 -3.658447 ... 4.631474 4.544892 -4.595605]
[ 6.51176 4.321688 -1.483697 ... 3.13299 5.476103 -2.833903]
[ 6.894113 5.986267 -1.178247 ... 2.305603 7.217919 -2.152574]]
type : numpy.ndarray
Error :
in pandas_input_fn(x, y, batch_size, num_epochs, shuffle, queue_capacity, num_threads, target_column)
85 'Cannot use name %s for target column: DataFrame already has a '
86 'column with that name: %s' % (target_column, x.columns))
---> 87 if not np.array_equal(x.index, y.index):
88 raise ValueError('Index for x and y are mismatched.\nIndex for x: %s\n'
89 'Index for y: %s\n' % (x.index, y.index))
Update 1: Using numpy_input_fn
train_input_fn= tf.estimator.inputs.numpy_input_fn(
x=X_train,
y=y_train,
batch_size=1000,
shuffle = True
)
Error:
INFO:tensorflow:Calling model_fn.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-3b7c6b879e38> in <module>()
10 start_time = time.time()
11 nn_classifier = tf.estimator.DNNClassifier(hidden_units=[1300,1300,1300], feature_columns=X_train, n_classes=200)
---> 12 nn_classifier.train(input_fn = train_input_fn, steps=2000)
13 total_time = start_time - time.time()
c:\users\appdata\local\programs\python\python36\lib\site-packages\tensorflow\python\estimator\estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
353
354 saving_listeners = _check_listeners_type(saving_listeners)
--> 355 loss = self._train_model(input_fn, hooks, saving_listeners)
356 logging.info('Loss for final step: %s.', loss)
357 return self
c:\users\appdata\local\programs\python\python36\lib\site-packages\tensorflow\python\estimator\estimator.py in _train_model(self, input_fn, hooks, saving_listeners)
822 worker_hooks.extend(input_hooks)
823 estimator_spec = self._call_model_fn(
--> 824 features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
825
826 if self._warm_start_settings:
c:\users\appdata\local\programs\python\python36\lib\site-packages\tensorflow\python\estimator\estimator.py in _call_model_fn(self, features, labels, mode, config)
803
804 logging.info('Calling model_fn.')
--> 805 model_fn_results = self._model_fn(features=features, **kwargs)
806 logging.info('Done calling model_fn.')
807
c:\users\appdata\local\programs\python\python36\lib\site-packages\tensorflow\python\estimator\canned\dnn.py in _model_fn(features, labels, mode, config)
347 head=head,
348 hidden_units=hidden_units,
--> 349 feature_columns=tuple(feature_columns or []),
350 optimizer=optimizer,
351 activation_fn=activation_fn,
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Any clue what I am doing wrong?
The problem is with feature_columns argument on the estimator. Take a look at tf.estimator.DNNClassifier documentation:
feature_columns: An iterable containing all the feature columns used by the model. All items in the set should be instances of classes derived from _FeatureColumn.
There is also an example usage in the doc. Your X_train looks like a number of numeric columns, in this case you can simply create a list like this:
feature_columns = [tf.feature_column.numeric_column(i) for i in range(...)]
I came across this error today and thought it would be great if I proved a solution.
The problem is brought about by tf.estimator.inputs.numpy_input_fn. according to the TensorFlow docs, X must be a pandas.DataFrame instance and y must be a pandas.Series or a pandas.DataFrame instance. The type() function can help determine the data types of your X_train and y_train values. Changing X_train and y_train to the appropriate data type solves the problem.

Categories

Resources