I have a function which takes an 2 arrays, a array of model predictions and an array of true values. It works fine when dealing with 1d arrays, but I need to adjust it for multidimensional arrays. I would like to compute my threshold_acc again but this time for each column. How do I go about this?
threshold_acc which represents the proportion of set with error below the specified threshold.
Also do I need to change my threshold to change my threshold to +- since I've started seeing negative values in the multidimensional array or is there a better error measure I could use?
import numpy as np
import pandas as pd
# sample data
np.random.seed(20)
dd = np.random.uniform(low=-20., high=20, size=(25, 4))
dp = np.random.uniform(low=5, high=25, size=(25, 4))
data = [dd, dp]
def inference( dummy_data, error_threshold=10):
rel_err_list = []
AE_error_list = []
mse_list = []
input_var = []
true_var =[]
pred_var = []
n_correct = 0; n_wrong = 0; n_inf =0
# Iterate through data loader and inference and evaluate data
targets, outputs = data[0], data[1]
for idx, (outputs, targets) in enumerate(zip(outputs, targets)):
rel_error = np.abs(outputs- targets )/targets
rel_error = rel_error * 100
AE_error = np.abs(outputs- targets )
if np.isfinite(rel_error).all():
rel_err_list.append(rel_error)
AE_error_list.append(AE_error)
# Negative errors
print(f"error: {rel_error} output: {outputs} target: {targets}")
else: n_inf +=1
if rel_error.all() < error_threshold:
n_correct +=1
else: n_wrong += 1
true_var.append(targets)
pred_var.append(outputs)
median_err, max_err, min_err = np.median(rel_err_list), np.max(rel_err_list), np.min(rel_err_list)
threshold_acc = ((n_correct * 1.0) / 25) * 100
true_var = np.array(true_var)
pred_var = np.array(pred_var)
err_var = np.array(rel_err_list)
AE_var = np.array(AE_error_list)
true_var = np.reshape(true_var, dummy_data[0].shape)
pred_var = np.reshape(pred_var, dummy_data[0].shape)
err_var = np.reshape(err_var, dummy_data[0].shape)
AE_var = np.reshape(AE_var, dummy_data[0].shape)
results = np.concatenate([true_var, pred_var, err_var, AE_var], axis=1)
results_df = pd.DataFrame(results)
return median_err, max_err, min_err, threshold_acc, n_inf, n_wrong, results_df, pred_var
dd = np.random.uniform(low=1., high=20, size=(25, 1))
dp = np.random.uniform(low=5, high=25, size=(25, 1))
median_err, max_err, min_err, threshold_acc, n_inf, n_wrong, results_df, pred_var = inference(data, 10)
print(f"\nAverage relative error over valid predictions : {median_err:.3f} \nMax error over valid predictions : {max_err:.3f} \nMin error over valid predictions : {min_err:.3f}\nProportion of test set with accuracy over 90%: {threshold_acc:.3f}\n\n\
{n_inf} null predictions \n{n_wrong} incorrect (<90%) predictions \n{n_inf+n_wrong} null or incorrect predictions out of 25")
median_err, max_err, min_err, threshold_acc, n_inf, n_wrong, results_df, pred_var = inference(data, 10)
print(f"\nAverage relative error over valid predictions : {median_err:.3f} \nMax error over valid predictions : {max_err:.3f} \nMin error over valid predictions : {min_err:.3f}\nProportion of test set with accuracy over 90%: {threshold_acc:.3f}\n\n\
{n_inf} null predictions \n{n_wrong} incorrect (<90%) predictions \n{n_inf+n_wrong} null or incorrect predictions out of 25")
Related
I am trying to shuffle my data that I have called new_array but when I do it returns an X_shuffled array of the same size but with zeros. I have no idea why, as my new_array has all the values there so X_shuffled should display an array with the signals shuffled.
I went back to basics and just build two arrays and used shuffle and this worked fine as you can see.
X=np.array([[1.1,2.2,3.3,4.4],[1.2,2.3,3.4,4.5],[2.1,2.2,2.3,2.4],[3.1,3.2,3.3,3.4],[4.1,4.2,4.3,4.4]])
print(X.shape)
y=np.array([0.1,0.2,0.2,3,4])
X_shuffle, Y_shuffle = shuffle(X,y)
My problematic code is attached below.
import numpy as np
from sklearn import preprocessing
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
def frequency_labels(s_frequency):
L = []
for w, f2 in enumerate(s_frequency):
l = " {} Hz".format(f2)
print("f1=",f2)
L.append(l)
return L
def time_labels(time):
H = []
for r,t in enumerate(time):
h = " {} s".format(t)
H.append(h)
return H
def gaussian_noise(increment,len_time):
mean = 0
standard_deviation = np.arange(0.5,2.2,increment)
## want 8096 different noise signals of different standard deviations
sd = standard_deviation.reshape(len(standard_deviation),1)
noise = np.empty((len(sd), (len_time), (1)), dtype=np.float16)
for t, value in enumerate(sd):
noise[t] = np.random.normal(mean,value,len_time).reshape(len_time,1)
return noise
max_freq = 50
s_frequency = np.arange(0,60,0.1) # range of frequencies
fs = 200
time = np.arange(0,5-(1/fs),(1/fs))
amplitude = np.empty((len(time)), dtype=np.float16)
len_time = len(time)
len_frequency = len(s_frequency)
array = np.empty((len(time)), dtype=np.float16)
increment = 0.1 #0.00021
L = frequency_labels(s_frequency)
H = time_labels(time)
k = 0
noise = gaussian_noise(increment,len_time)
new_array = np.empty((len(s_frequency)*(len(noise)),len(time)),dtype=np.float16)
training_labels = []
for f1 in s_frequency:
#amplitude of signal and adding the noise onto this
amplitude = np.sin(2*np.pi*f1*time).reshape(len(time),1)
amplitude = np.add(noise,amplitude).reshape(len(noise),len(time))
#Normalizing between -1 and 1
for r in range(17):
average = float(min(amplitude[r,:]) + max(amplitude[r,:]))/2
rangev = float(max(amplitude[r,:]) - min(amplitude[r,:]))/2
new_array[k] = (amplitude[r,:] - average)/rangev
for q in range(17):
training_labels.append(f1)
k = k + 1
training_labels = np.array(training_labels).reshape((len(s_frequency)*len(noise)),)
#shuffle the data
X_shuffled, Y_shuffled = shuffle(new_array.astype('float64'), training_labels,random_state=0)
x_train, x_test, y_train, y_test = train_test_split(X_shuffled,Y_shuffled,test_size=0.2)
Any ideas why this is happening?
Found this snippet code to basically increase my negative reviews to better train my model. When I went to run it through I am getting this error. Looks to be around the idx. Does anyone have a good solution for this?
Passing list-likes to .loc or [] with any missing labels is no longer supported
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'
from sklearn.utils import shuffle
import numpy as np
labels, num = np.unique(y_train, return_counts=True)
#print(labels)
u=min(labels)
intial = 1
#set the desired size of the oversampled cells
maxcnt = np.int(max(num)/2)
for labl, n in zip(labels, num):
x0 = X_train[y_train==labl]
y0 = y_train[y_train==labl]
# print (x0)
remain = maxcnt
print (remain)
while remain >= n;
if label == u and initial == 1;
X_Train = x0
y_Train = y0
remain -= n
initial = 0
else:
X_Train = np.concatenate((X_Train, x0), axis=0)
y_Train = np.concatenate((y_Train, y0), axis=0)
remain -= n
if remain > 0 and remain < n:
idx = np.random.choice(np.arange(len(y0)), remain, replace=False)
#print(idx)
X_Train = np.concatenate((X_Train, x0[idx]), axis=0)
y_Train = np.concatenate((y_Train, y0[idx]), axis=0)
remain -= n
X_Train, y_Train = shuffle(X_Train, y_Train)
np.unique(X_Train, return_counts=True)
I'm trying to create a network, that would help predict stock prices the following day. My input data are: open, high, low and close stock values, volume, index values, a few technical indicators and exchange rate; the output is closing price from the next day. I'm using data uploaded from Excel file.
I wrote a program, that I will paste below, but it doesn't seem to be working correctly. Network always returns 1, 0 or other constant value (between 0 - 1).
I took the following steps so far:
tried to normalise the data like so: X_norm = X/(10 ** d) where d is the smallest number for which this conditon is met: abs(X_norm) < 1. I did that for the whole set in Excel before dividing it into training and test.
shuffled the data before dividing it into training/test, so that learning examples are not from consecutive days
running the network on a smaller data set and on example data set (I generated random numbers and did a simple math using them for an output and tried running network with that)
changing amount of hidden neurons
chaninging number of iterations (up to a 1000, which was a lot for my computer considering the data set, so I didn't try any more because it would take too much time)
changing learning rate.
No matter what steps I took the outcome was always the same. I think my problem could be that I don't have a bias, but perhaps I also have other mistakes in my code that are contributing to this error.
My program:
import numpy as np
import pandas as pd
df = pd.read_excel(r"path", sheet_name="DATA", index_col=0, header=0)
df = df.to_numpy()
np.random.shuffle(df)
X_data = df[:, 0:15]
X_data = X_data.reshape(1000, 1, 15)
print(f"X_data: {X_data}")
Y_data = df[:, 15]
Y_data = Y_data.reshape(1000, 1, 1)
print(f"Y_data: {Y_data}")
X = X_data[0:801]
x_test = X_data[801:]
y = Y_data[0:801]
y_test = Y_data[801:]
print(f"X_train: {X}")
print(f"x_test: {x_test}")
print(f"Y_train: {y}")
print(f"y_test: {y_test}")
rate = 0.2
class NeuralNetwork:
def __init__(self):
self.input_neurons = 15
self.hidden1_neurons = 10
self.hidden2_neurons = 5
self.output_neuron = 1
self.input_to_hidden1_w = (np.random.random((self.input_neurons, self.hidden1_neurons))) # 14x30
self.hidden1_to_hidden2_w = (np.random.random((self.hidden1_neurons, self.hidden2_neurons))) # 30x20
self.hidden2_to_output_w = (np.random.random((self.hidden2_neurons, self.output_neuron))) # 20x1
def activation(self, x):
sigmoid = 1/(1+np.exp(-x))
return sigmoid
def activation_d(self, x):
derivative = x * (1 - x)
return derivative
def feed_forward(self, X):
self.z1 = np.dot(X, self.input_to_hidden1_w)
self.z1_a = self.activation(self.z1)
self.z2 = np.dot(self.z1_a, self.hidden1_to_hidden2_w)
self.z2_a = self.activation(self.z2)
self.z3 = np.dot(self.z2_a, self.hidden2_to_output_w)
output = self.activation(self.z3)
return output
def backward(self, X, y, rate, output):
error = y - output
z3_error_delta = error * self.activation_d(output)
z2_error = np.dot(z3_error_delta, np.transpose(self.hidden2_to_output_w))
z2_error_delta = z2_error * self.activation_d(self.z2)
z1_error = np.dot(z2_error_delta, np.transpose(self.hidden1_to_hidden2_w))
z1_error_delta = z1_error * self.activation_d(self.z1)
self.input_to_hidden1_w += rate * np.dot(np.transpose(X), z1_error_delta)
self.hidden1_to_hidden2_w += rate * np.dot(np.transpose(self.z1), z2_error_delta)
self.hidden2_to_output_w += rate * np.dot(np.transpose(self.z2), z3_error_delta)
def train(self, X, y):
output = self.feed_forward(X)
self.backward(X, y, rate, output)
def save_weights(self):
np.savetxt("w1.txt", self.input_to_hidden1_w, fmt="%s")
np.savetxt("w2.txt", self.hidden1_to_hidden2_w, fmt="%s")
np.savetxt("w3.txt", self.hidden2_to_output_w, fmt="%s")
def check(self, x_test, y_test):
self.feed_forward(x_test)
np.mean(np.square((y_test - self.feed_forward(x_test))))
Net = NeuralNetwork()
for l in range(100):
for i, pattern in enumerate(X):
for j, outcome in enumerate(y):
print(f"#: {l}")
print(f'''
# {str(l)}
# {str(X[i])}
# {str(y[j])}''')
print(f"Predicted output: {Net.feed_forward(X[i])}")
Net.train(X[i], y[j])
print(f"Error training: {(np.mean(np.square(y - Net.feed_forward(X))))}")
Net.save_weights()
for i, pattern in enumerate(x_test):
for j, outcome in enumerate(y_test):
Net.check(x_test[i], y_test[j])
print(f"Error test: {(np.mean(np.square(y_test - Net.feed_forward(x_test))))}")
I'm working on a project using Python(3.6) and Sklearn.I have done classifications but when I try to apply it for reshaping in order to use it with fit method of sklearn it returns an error.
Here's what I have tried:
# Get all the columns from dataframe
columns = data.columns.tolist()
# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]
# store the variables we want to predicting on
target = "Class"
X = data.drop(target, 1)
Y = data[target]
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)
# define a random state
state = 1
# define the outlier detection method
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors = 20,
contamination = outlier_fraction)
}
# fit the model
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit te data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid and 1 for fraudulent
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# run classification metrics
print('{}:{}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred ))
print(classification_report(Y, y_pred ))
Then it returns the following error:
ValueError: could not convert string to float: '301.48 Change: $0.00'
and it's pointed to `clf.fit(X)` line.
What have I configured wrong?
We can convert out dataset to numeric data values on the base of their uniqueness and you can also drop un-necessary columns form the dataset.
Here's how you can try that:
df_full = pd.read_excel('input/samp.xlsx', sheet_name=0,)
df_full = df_full[df_full.filter(regex='^(?!Unnamed)').columns]
df_full.drop(['paymentdetails',], 1, inplace=True)
df_full.drop(['timestamp'], 1, inplace=True)
# Handle non numaric data
def handle_non_numaric_data(df_full):
columns = df_full.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df_full[column].dtype != np.int64 and df_full[column].dtype != np.float64:
column_contents = df_full[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x+=1
df_full[column] = list(map(convert_to_int, df_full[column]))
return df_full
I'm trying to normalize my dataset which is 1.7 Gigabyte. I have 14Gig of RAM and I hit my limit very quickly.
This happens when computing the mean/std of the training data. The training data takes up the majority of the memory when loaded into RAM(13.8Gig),thus the mean gets calculated, but when it reaches to the next line while calculating the std, it crashes.
Follows the script:
import caffe
import leveldb
import numpy as np
from caffe.proto import caffe_pb2
import cv2
import sys
import time
direct = 'examples/svhn/'
db_train = leveldb.LevelDB(direct+'svhn_train_leveldb')
db_test = leveldb.LevelDB(direct+'svhn_test_leveldb')
datum = caffe_pb2.Datum()
#using the whole dataset for training which is 604,388
size_train = 604388 #normal training set is 73257
size_test = 26032
data_train = np.zeros((size_train, 3, 32, 32))
label_train = np.zeros(size_train, dtype=int)
print 'Reading training data...'
i = -1
for key, value in db_train.RangeIter():
i = i + 1
if i % 1000 == 0:
print i
if i == size_train:
break
datum.ParseFromString(value)
label = datum.label
data = caffe.io.datum_to_array(datum)
data_train[i] = data
label_train[i] = label
print 'Computing statistics...'
print 'calculating mean...'
mean = np.mean(data_train, axis=(0,2,3))
print 'calculating std...'
std = np.std(data_train, axis=(0,2,3))
#np.savetxt('mean_svhn.txt', mean)
#np.savetxt('std_svhn.txt', std)
print 'Normalizing training'
for i in range(3):
print i
data_train[:, i, :, :] = data_train[:, i, :, :] - mean[i]
data_train[:, i, :, :] = data_train[:, i, :, :]/std[i]
print 'Outputting training data'
leveldb_file = direct + 'svhn_train_leveldb_normalized'
batch_size = size_train
# create the leveldb file
db = leveldb.LevelDB(leveldb_file)
batch = leveldb.WriteBatch()
datum = caffe_pb2.Datum()
for i in range(size_train):
if i % 1000 == 0:
print i
# save in datum
datum = caffe.io.array_to_datum(data_train[i], label_train[i])
keystr = '{:0>5d}'.format(i)
batch.Put( keystr, datum.SerializeToString() )
# write batch
if(i + 1) % batch_size == 0:
db.Write(batch, sync=True)
batch = leveldb.WriteBatch()
print (i + 1)
# write last batch
if (i+1) % batch_size != 0:
db.Write(batch, sync=True)
print 'last batch'
print (i + 1)
#explicitly freeing memory to avoid hitting the limit!
#del data_train
#del label_train
print 'Reading test data...'
data_test = np.zeros((size_test, 3, 32, 32))
label_test = np.zeros(size_test, dtype=int)
i = -1
for key, value in db_test.RangeIter():
i = i + 1
if i % 1000 == 0:
print i
if i ==size_test:
break
datum.ParseFromString(value)
label = datum.label
data = caffe.io.datum_to_array(datum)
data_test[i] = data
label_test[i] = label
print 'Normalizing test'
for i in range(3):
print i
data_test[:, i, :, :] = data_test[:, i, :, :] - mean[i]
data_test[:, i, :, :] = data_test[:, i, :, :]/std[i]
#Zero Padding
#print 'Padding...'
#npad = ((0,0), (0,0), (4,4), (4,4))
#data_train = np.pad(data_train, pad_width=npad, mode='constant', constant_values=0)
#data_test = np.pad(data_test, pad_width=npad, mode='constant', constant_values=0)
print 'Outputting test data'
leveldb_file = direct + 'svhn_test_leveldb_normalized'
batch_size = size_test
# create the leveldb file
db = leveldb.LevelDB(leveldb_file)
batch = leveldb.WriteBatch()
datum = caffe_pb2.Datum()
for i in range(size_test):
# save in datum
datum = caffe.io.array_to_datum(data_test[i], label_test[i])
keystr = '{:0>5d}'.format(i)
batch.Put( keystr, datum.SerializeToString() )
# write batch
if(i + 1) % batch_size == 0:
db.Write(batch, sync=True)
batch = leveldb.WriteBatch()
print (i + 1)
# write last batch
if (i+1) % batch_size != 0:
db.Write(batch, sync=True)
print 'last batch'
print (i + 1)
How can I make it consume less memory so that I can get to run the script?
Why not compute the statistics on a subset of the original data? For example, here we compute the mean and std for just 100 points:
sample_size = 100
data_train = np.random.rand(1000, 20, 10, 10)
# Take subset of training data
idxs = np.random.choice(data_train.shape[0], sample_size)
data_train_subset = data_train[idxs]
# Compute stats
mean = np.mean(data_train_subset, axis=(0,2,3))
std = np.std(data_train_subset, axis=(0,2,3))
If your data is 1.7Gb, it is highly unlikely that you need all the data to get an accurate estimation of the mean and std.
In addition, could you get away with fewer bits in your datatype? I'm not sure what datatype caffe.io.datum_to_array returns, but you could do:
data = caffe.io.datum_to_array(datum).astype(np.float32)
to ensure the data is float32 format. (If the data is currently float64, then this will save you half the space).
The culprit that caused so much issues and constant crashing due to insufficient memory, was due to batch size being the size of whole training set:
print 'Outputting test data'
leveldb_file = direct + 'svhn_test_leveldb_normalized'
batch_size = size_test
This apparently was the cause, nothing would get committed and saved to the disk until the whole dataset was read and loaded into one huge transaction, this is also the case when using np.float32 suggested by #BillCheatham didn't work properly.
The memorymap solution wouldn't work for me for some reason and I used the solution I mentioned above.
PS: Later on, I completely changed to float32, fixed the batch_size and ran the whole thing all together, that's how I could say my former solution (divide and add the fractions together) works and gives the exact number up to 2 decimals.