Pandas DataFrame and Keras - python

I'm trying to perform a sentiment analysis in Python using Keras. To do so, I need to do a word embedding of my texts. The problem appears when I try to fit the data to my model:
model_1 = Sequential()
model_1.add(Embedding(1000,32, input_length = X_train.shape[0]))
model_1.add(Flatten())
model_1.add(Dense(250, activation='relu'))
model_1.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The shape of my train data is
(4834,)
And is a Pandas series object. When I try to fit my model and validate it with some other data I get this error:
model_1.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=64, verbose=2)
ValueError: Error when checking model input: expected
embedding_1_input to have shape (None, 4834) but got array with shape
(4834, 1)
How can I reshape my data to make it suited for Keras? I've been trying with np.reshape but I cannot place None elements with that function.
Thanks in advance

None is the number of expected rows that goes into training therefore you can't define it. Also Keras needs a numpy array as input and not a pandas dataframe. First convert the df to a numpy array with df.values and then do np.reshape((-1, 4834)). Note that you should use np.float32. This is important if you train it on GPU.

https://pypi.org/project/keras-pandas/
Easiest way is having the keras_pandas package to fit a pandas dataframe to keras.The code shown below is an general example from the package docs.
from keras import Model
from keras.layers import Dense
from keras_pandas.Automater import Automater
from keras_pandas.lib import load_titanic
observations = load_titanic()
# Transform the data set, using keras_pandas
categorical_vars = ['pclass', 'sex', 'survived']
numerical_vars = ['age', 'siblings_spouses_aboard', 'parents_children_aboard', 'fare']
text_vars = ['name']
auto = Automater(categorical_vars=categorical_vars, numerical_vars=numerical_vars, text_vars=text_vars,
response_var='survived')
X, y = auto.fit_transform(observations)
# Start model with provided input nub
x = auto.input_nub
# Fill in your own hidden layers
x = Dense(32)(x)
x = Dense(32, activation='relu')(x)
x = Dense(32)(x)
# End model with provided output nub
x = auto.output_nub(x)
model = Model(inputs=auto.input_layers, outputs=x)
model.compile(optimizer='Adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train model
model.fit(X, y, epochs=4, validation_split=.2)

You need a specific version of Pandas for this to work. If you use the current version (as of 20th Aug 2018) this will fail.
Rollback your Pandas and Keras (pip uninstall ....) and then install a specific version like this
python -m pip install pandas==0.19.2

Use tf.data.Dataset.from_tensor_slices to read the values from a pandas dataframe.
See https://www.tensorflow.org/tutorials/load_data/pandas_dataframe for reference how to do this properly in TF2.x

Related

ValueError: Target data is missing. Your model was compiled with loss=<keras.losses.MeanSquaredError ... in regression task

I am pretty new to neural networks/keras so probably I am missing something obvious. The goal of this is to put a time-series in and then predict the next value.
The dataset construction is the following:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_probability as tfp
import numpy as np
train_dataset, test_dataset = test_train_split(data['ma'], 20)
train_dataset = np.array(train_dataset)
train_dataset = [np.reshape(train_dataset, (train_dataset.shape[0], 1))]
test_dataset = np.array(test_dataset)
test_dataset = [np.reshape(test_dataset, (test_dataset.shape[0], 1))]
The test_train_split(data['ma'], 20) function returns 2 lists of float values. So in the end train_dataset is a list of a np.array if shape (380,1) and ```test_dataset`` one of size (20,1).
Now I defined a bayesian neural network:
hidden_units = [8, 8] #length is how many layers and the number is how many units per layer
train_size = train_dataset[0].shape[0]
inputs = layers.Input(shape=(1,))
features = inputs
for units in hidden_units:
features = tfp.layers.DenseVariational(
units=units,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1 / train_size,
activation="sigmoid",
)(features)
outputs = layers.Dense(units=1)(features)
bnn_model = keras.Model(inputs=inputs, outputs=outputs)
Now I compile:
mse_loss = keras.losses.MeanSquaredError()
bnn_model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss=mse_loss,
metrics=[keras.metrics.RootMeanSquaredError()],
)
And finally I try to use the model.fit() function like this:
bnn_model.fit(train_dataset, validation_data=test_dataset, epochs = 100)
But this gives me the following error: ValueError: Target data is missing. Your model was compiled with loss=<keras.losses.MeanSquaredError object at 0x0000020A67D56610>, and therefore expects target data to be provided in `fit()`.
I tried searching online but it still doesn't make sense to me. I suspect it has something to do with how I construct the datasets but I am not sure how. Let me know if I need to provide further information and thanks for the help!

Keras Sequential: advantage of creating Tensorflow numeric_columns?

I am learning about creating neural networks using Keras and running through various tutorials. In one, the model is built using a series of tf.feature_column.numeric_column and passing that to the Keras Sequential model (in this example feat_cols are the feature columns):
feature_columns = {c: tf.feature_column.numeric_column(c) for c in feat_cols}
model = Sequential([DenseFeatures(feature_columns=feature_columns.values()),
Dense(units=32, activation='relu',
Dense(units=8, activation='relu'),
Dense(units=1, activation='linear'])
In another tutorial, the initial input layer is just taken right from a pandas dataframe converted into a numpy array by using .values. The dictionary of tensors is never created, and the first layer doesn't have the DenseFeatures bit. (In this case df is the dataframe, features is a list of feature columns and lbl is the target column)
x = df[features].values
y = df[lbl].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.05)
model = Sequential([
Dense(10, input_shape=(6, ), activation='relu'),
Dense(20, activation='relu'),
Dense(5, activation='relu'),
Dense(1)])
In this case when model.fit is called, just x_train and y_train are passed instead of the tensor dict in the first example.
My question is what is the advantage or disadvantage (if any) of these two approaches? Are they two ways of getting to the same place or is there an actual difference?
Note that the sequential nets are definitely not equivalent. But if you consider only the input components, they would be essentially the same. Both are valid ways to pass your data into the net. However in practice, dataframes are more common data sources, and tensors are slightly easier to handle with Tensorflow. With the keras API however there should be no performance difference. See tensorflow.org/tutorials/load_data/pandas_dataframe

ValueError: Layer sequential_20 expects 1 inputs, but it received 2 input tensors

I am trying to build a simple Autoencoder using the KMNIST dataset from Tensorflow and some sample code from a textbook I'm using, but I keep getting an error when I try to fit the model.
The error says ValueError: Layer sequential_20 expects 1 inputs, but it received 2 input tensors.
I'm really new to TensorFlow, and all my research on this error has baffled me since it seems to involve things not in my code.
This thread wasn't helpful since I'm only using sequential layers.
Code in full:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import pandas as pd
import matplotlib.pyplot as plt
#data = tfds.load(name = 'kmnist')
(img_train, label_train), (img_test, label_test) = tfds.as_numpy(tfds.load(
name = 'kmnist',
split=['train', 'test'],
batch_size=-1,
as_supervised=True,
))
img_train = img_train.squeeze()
img_test = img_test.squeeze()
## From Hands on Machine Learning Textbook, chapter 17
stacked_encoder = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(100, activation="selu"),
keras.layers.Dense(30, activation="selu"),
])
stacked_decoder = keras.models.Sequential([
keras.layers.Dense(100, activation="selu", input_shape=[30]),
keras.layers.Dense(28 * 28, activation="sigmoid"),
keras.layers.Reshape([28, 28])
])
stacked_ae = keras.models.Sequential([stacked_encoder, stacked_decoder])
stacked_ae.compile(loss="binary_crossentropy",
optimizer=keras.optimizers.SGD(lr=1.5))
history = stacked_ae.fit(img_train, img_train, epochs=10,
validation_data=[img_test, img_test])
it helped me when I changed:
validation_data=[X_val, y_val] into validation_data=(X_val, y_val)
Actually still wonder why?
Use validation_data=(img_test, img_test) instead of validation_data=[img_test, img_test]
Here the example with encoder and decoder combined together:
stacked_ae = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(100, activation="selu"),
keras.layers.Dense(30, activation="selu"),
keras.layers.Dense(100, activation="selu"),
keras.layers.Dense(28 * 28, activation="sigmoid"),
keras.layers.Reshape([28, 28])
])
stacked_ae.compile(loss="binary_crossentropy",
optimizer=keras.optimizers.SGD(lr=1.5))
history = stacked_ae.fit(img_train, img_train, epochs=10,
validation_data=(img_test, img_test))
As stated in the Keras API reference (link),
validation_data: ... validation_data could be: - tuple (x_val, y_val) of Numpy arrays or tensors - tuple (x_val, y_val, val_sample_weights) of Numpy arrays - dataset ...
So, validation_data has to be a tuple rather than a list (of Numpy arrays or tensors). We should use parentheses (round brackets) (...), not square brackets [...].
According to my limited experience, however, TensorFlow 2.0.0 would be indifferent to the use of square brackets, but TensorFlow 2.3.0 would complain about it. Your script would be fine if it is run under TF 2.0 intead of TF 2.3.
You have given data instead of labels two times:
history = stacked_ae.fit(img_train, img_train, epochs=10,
validation_data=[img_test, img_test])
instead of
history = stacked_ae.fit(img_train, label_train, epochs=10,
validation_data=[img_test, label_test])
In solutions, some says you need to change from bracelet to parentheses, but it was not working in Colab. And yes, turningvalidation_data=[X_val, y_val] into validation_data=(X_val, y_val) should work since it's required format, but in tf==2.5.0(in Google Colab) it doesn't solve the problem. I changed from functional API to sequential API, that solves the problem. Strange.
This error may also be triggered by submitting the wrong object to model.fit(). It happened to me when I mistakenly tried to execute
model.fit(images)
when I wanted to execute
model.fit(dataset)
with
dataset = tf.data.Dataset.from_tensor_slices((images, images))

What values are returned from model.evaluate() in Keras?

I've got multiple outputs from my model from multiple Dense layers. My model has 'accuracy' as the only metric in compilation. I'd like to know the loss and accuracy for each output. This is some part of my code.
scores = model.evaluate(X_test, [y_test_one, y_test_two], verbose=1)
When I printed out the scores, this is the result.
[0.7185557290413819, 0.3189622712272771, 0.39959345855771927, 0.8470299135229717, 0.8016634374641469]
What are these numbers represent?
I'm new to Keras and this might be a trivial question. However, I have read the docs from Keras but I'm still not sure.
Quoted from evaluate() method documentation:
Returns
Scalar test loss (if the model has a single output and no metrics) or
list of scalars (if the model has multiple outputs and/or metrics).
The attribute model.metrics_names will give you the display labels
for the scalar outputs.
Therefore, you can use metrics_names property of your model to find out what each of those values corresponds to. For example:
from keras import layers
from keras import models
import numpy as np
input_data = layers.Input(shape=(100,))
out_1 = layers.Dense(1)(input_data)
out_2 = layers.Dense(1)(input_data)
model = models.Model(input_data, [out_1, out_2])
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
print(model.metrics_names)
outputs the following:
['loss', 'dense_1_loss', 'dense_2_loss', 'dense_1_mean_absolute_error', 'dense_2_mean_absolute_error']
which indicates what each of those numbers you see in the output of evaluate method corresponds to.
Further, if you have many layers then those dense_1 and dense_2 names might be a bit ambiguous. To resolve this ambiguity, you can assign names to your layers using name argument of layers (not necessarily on all of them but only on the input and output layers):
# ...
out_1 = layers.Dense(1, name='output_1')(input_data)
out_2 = layers.Dense(1, name='output_2')(input_data)
# ...
print(model.metrics_names)
which outputs a more clear description:
['loss', 'output_1_loss', 'output_2_loss', 'output_1_mean_absolute_error', 'output_2_mean_absolute_error']
We should be clear that the "loss" figure is the sum of ALL the losses calculated for each item in the x_test array. x_test would contain your test data and y_test would contain your labels. The loss figure is the sum of ALL the losses, not just one loss from one item in the x_test array.

Keras model taking forerver to train with dask dataframe

I m working with large dataset having low memory and I got introduced to Dask dataframe. What I understood from the docs that Dask does not load whole dataset into memory . instead it created multiple threads which will fetch the records from disk on demand basis. So I suppose keras model with having batch size = 500, it should only have 500 records in the memory at the training time. but when I start training. it takes forever.May be I am doing something wrong.please suggest.
shape of training data: 1000000 * 1290
import glob
import dask.dataframe
paths_train = glob.glob(r'x_train_d_final*.csv')
X_train_d = dd.read_csv('.../x_train_d_final0.csv')
Y_train1 = keras.utils.to_categorical(Y_train.iloc[,1], num_classes)
batch_size = 500
num_classes = 2
epochs = 5
model = Sequential()
model.add(Dense(645, activation='sigmoid', input_shape=(1290,),kernel_initializer='glorot_normal'))
#model.add(Dense(20, activation='sigmoid',kernel_initializer='glorot_normal'))
model.add(Dense(num_classes, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer=Adam(decay=0),
metrics=['accuracy'])
history = model.fit(X_train_d.to_records(), Y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
class_weight = {0:1,1:6.5},
shuffle=False)
You should use fit_generator() from Sequential model with generator or with a Sequence instance. Both provide a proper way to load only a portion of data.
Keras docs provide an excellent example:
def generate_arrays_from_file(path):
while 1:
f = open(path)
for line in f:
# create Numpy arrays of input data
# and labels, from each line in the file
x, y = process_line(line)
yield (x, y)
f.close()
model.fit_generator(generate_arrays_from_file('/my_file.txt'),
steps_per_epoch=1000, epochs=10)
Today Keras does not know about Dask dataframes or arrays. I suspect that it is just converting the dask object into the equivalent Pandas or Numpy object instead.
If your Keras model can be trained incrementally then you could solve this problem using dask.delayed and some for loops.
Eventually it would be nice to see the Keras and Dask projects learn more about each other to facilitate these workloads without excess work.

Categories

Resources