I'm trying to train some neural network using sknn. I have preprocessed my data through a pandas dataframe. The preprocessing works fine when I use the fit(x_train,y_train) on standard sklearn classifiers, but it throws the attribute error
anaconda/envs/py3k/lib/python3.4/site-packages/pandas/core/generic.py", line 2360, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'todense'
or this error:
/anaconda/envs/py3k/lib/python3.4/site-packages/pandas/core/indexing.py", line 1750, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
Seemingly at random (different runs, without changing anything).
The relevant piece of code looks like this:
x_train, x_test, y_train, y_test = cross_validation.train_test_split(X_data, Y_data, test_size=1/kfold)
regr = linear_model.LinearRegression(copy_X=True,fit_intercept=True)
abr = AdaBoostRegressor(base_estimator=tree.DecisionTreeRegressor(max_depth=max_depth_gridsearch_values[max_depth_counter]), n_estimators = n_estimators_gridsearch_values[n_estimators_counter])
nn=nn_simple_regressor
x_train_numeric = x_train.iloc[:,2:]
x_test_numeric = x_test.iloc[:,2:]
regr.fit(x_train_numeric, y_train)
abr.fit(x_train_numeric, y_train)
nn.fit(x_train_numeric,y_train)
And the regressor is defined as
nn_simple_regressor = Regressor(
layers=[
Layer("Rectifier", units=100),
Layer("Linear")],
learning_rate=0.02,
n_iter=10)
I cannot understand why this is happening, and seems like the support for sknn is pretty small. I suspect the issue is actually with the preprocessing, but I don't understand why it works for the first two classifiers but not my NN. Any ideas?
As of February 2016, Sknn does not support pandas. In order to fix the issues stated in the question, the best approach is to convert the dataframe into a numpy array. Using the .as_martix() function in pandas is the easiest way to do so.
Related
I'm trying to create a Keras model to train with a group of images, taken from a list of paths.
I know that the method tf.keras.utils.image_dataset_from_directory exists but it doesn't meet my needs because I want to learn the correct way to handle images and because I need to make a regression, not a classification.
Every approach I tried failed one way or another, mostly because the type of the x_train variable is wrong.
The most promising function I used to load a single image is:
def encode_image(img_path):
img = tf.keras.preprocessing.image.load_img(img_path)
img_array = tf.keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0)
return img_array
x_train = df['filename'].apply(lambda i: encode_image(i))
This doesn't work because, when I call the .fit() method this way:
history = model.fit(x_train, y_train, epochs=1)
I receive the following error:
Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray)
This makes me understand that I'm passing the data in a wrong format.
Can someone provide me a basic example of creating a (x_train, y_train) pair to feed a model for training using a set of images?
Thank you very much
I got 2 CSV named train.csv and test.csv.
Both files have the same structure, and I want to use train.csv as train data and test.csv as test data.
The thing is, I can't find anywhere how to use scikit-learn linear regression without using split, every tutorial/documentation I find uses the function train_test_split(), but if I understand correctly it's used to split one file (let's say data.csv) as both train and test data.
Is it even possible? If no, what alternative can I use?
If you have separate train, test data,
define X_train and y_train
X_train is the features excluding the target variable
# Sudo Code
X_train = train.drop(target, axis=1)
y_train is the target variable
# Sudo Code
y_train = train[target]
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)
I am relatively new to Tensorflow and have been putting together some model training based on the tutorial I found on the ts website. I have been able to put together something functional that satisfies my preliminary requirements.
I am reading locally a csv files that provides some links towards images associated with labels written on the same csv row. My code roughly look like that:
def map_func(*row):
img = process_img(img_filename)
output = read(row)
return img, output
dataset = tf.data.experimental.CsvDataset(CSV_FILE, default_format, header=True)
dataset = dataset.map(map_func)
dataset = dataset.shuffle(buffer_size=shuffle_buffer_size)
dataset = dataset.batch(NB_IMG)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
X, y = next(iter(dataset))
X_train, X_test = tf.split(X, split, axis=0)
y_train, y_test = tf.split(y, split, axis=0)
model = create_model()
model.compile(optimizer=OPTIMIZER, loss='mse')
model.fit(x=X_train, y=y_train, epochs=EPOCHS, validation_data=(X_test, y_test))
NB_IMG is the total number of images I have. EPOCHS is here arbitrary fixed to a given value (in general 20 or 40) and the split is a ratio applied on NB_IMG.
All my images are locally on my computer and with that code my GPU currently can manage up to 50000 images roughly. The training is failing with more images (GPU exhausted). I can understand that is due to the fact that I am reading the data all at once, but I am bit blocked to take the next step here to be able to manage a bigger dataset.
This part below is the one that need improvement I guess:
X, y = next(iter(dataset))
Could someone here help me to move forward and guide me towards some examples or snippets where I can train the model on a bigger dataset? I am a bit lost here for the next move and not sure where to focus in the ts documentation. I did not really find a clear example online that would suit my needs. How should I loop on different batches? How is coded the iterator?
Thanks!
Well, can you give more details about the two functions process_img and read?
During my experiments, I have noticed that the shuffle function can be slow when you have a lot of data and the buffer size is big. Try to comment that line and check if it runs faster. If so, you can use pandas to load your CSV file and then shuffle it and use tf.data.Dataset.from_tensor_slices
Tensorflow has a great tool now to profile models and the dataset pipeline (https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras).
process_img and read are very simple functions:
def process_img(filename):
img = tf.io.read_file(filename)
return tf.image.decode_jpeg(img, channels=3)
def read(row):
return row[1]
The shuffle part of my code is slow but does not seem to be the cause of failure, I can remove it and shuffle the data directly from the csv. It seems to fail at the X, y = next(iter(dataset)) line if the dataset is too big
Thanks for your suggestions to profile the code, I will give it a go. Is there any other possible approach to split and iterate among the dataset?
I created a VotingClassifier() object using sklearn. Later, I save it to voting_predictor.pkl file using joblib. While I load it successfully, when I try to predict some data as voting_predictor.predict(X_test) I get the following error:
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
I tried to dump/load the object with pickle and I got the same exact error. The code looks like this:
eclf1 = VotingClassifier(estimators=estimators, voting='hard')
eclf1 = eclf1.fit(X_train, y_train)
y_pred = eclf1.predict(X_test)
report = classification_report(y_test, y_pred)
poll_accuracy = accuracy_score(y_test, y_pred)
print(report)
print(poll_accuracy)
# successful object dump
filename = 'voting_predictor.pkl'
joblib.dump(eclf1, filename)
#successful object load
voting_predictor = joblib.load(filename)
# this prints the object correctly, showing all its parameters
print(voting_predictor)
#error shows here
y_pred = voting_predictor.predict(X_test)
report = classification_report(y_test, y_pred)
poll_accuracy = accuracy_score(y_test, y_pred)
The print(voting_predictor) prints successfully the object and all its parameters. Any ideas about why this is happening?
I got the same error while ensembling catbooster with other predictors.
I found this solution however I am looking for a more elegant one.
The problem was that the target column was the name of the classes, as string. It seems that leaving the string value without label-encoding it to some integer, was causing this error. However, in any other case, sklearn handled the string name of each class correctly, providing all metrics such as classification_report and accuracy_score without errors. The errors occurred only when I loaded the object from the file.
I have two data sets, trainig and test. They have labels "1" and "0". I need to evaluate these data sets using "oneClassSVM" Algorithm with "rbf" kernel in scikit learn. I loaded training data set, but I have no idea how to evaluate that with test data set. Below is my code,
from sklearn import svm
import numpy as np
input_file_data = "/home/anuradha/TrainData.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Please some one can help me to solve this problem ?
It's as simple as adding the following two lines of code at the end of your script:
estimator.fit(X_train)
y_pred_test = estimator.predict(X_test)
The first line tells svn which training data to use and the second one makes prediction on the test set (be sure to load both datasets and to change variable names accordingly).
Here there is a complete example on how to use OneClassSVM and here the class reference.