Slicing data frame for N steps in the future - tensorflow - python

I'm currently tinkering with Tensorflow in my spare time. I came up with a simple data set:
A B C D E
1 2 5 7 9
2 4 10 14 18
3 6 15 21 27
4 8 20 28 36
5 10 25 35 45
6 12 30 42 54
7 14 35 49 63
8 16 40 56 72
9 18 45 63 81
I am slicing the data as follows:
sc = MinMaxScaler(feature_range=(0,1))
training_set = dataset_train.iloc[:,:].values
scaled_training_set = sc.fit_transform(training_set)
time_step=3
X_train = []
Y_train = []
for i in range (time_step,len(training_set)):
X_train.append(scaled_training_set[i-time_step:i,:])
Y_train.append(scaled_training_set[i:i+3,3])
X_train = np.array(X_train)
Y_train = np.array(Y_train)
X_train = np.reshape(X_train,(X_train.shape[0],X_train.shape[1],-1))
print(X_train.shape,Y_train.shape)
And my model is as follows:
model = Sequential()
model.add(LSTM(units=50,return_sequences=True, input_shape=(X_train.shape[1],X_train.shape[2])))
model.add(Dropout(0.3))
model.add(LSTM(units=50,return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=10,return_sequences=True))
model.add(Dense(units=3))
model.compile(optimizer='adam',loss='mean_squared_error')
model.summary()
model.fit(X_train,Y_train,epochs=50,batch_size=time_step)
However, when I attempt this model, I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-120-ec3d3648bdb9> in <module>()
13 model.compile(optimizer='adam',loss='mean_squared_error')
14 model.summary()
---> 15 model.fit(X_train,Y_train,epochs=50,batch_size=time_step)
13 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
96 dtype = dtypes.as_dtype(dtype).as_datatype_enum
97 ctx.ensure_initialized()
---> 98 return ops.EagerTensor(value, ctx.device_name, dtype)
99
100
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
I am assuming this is an issue with either (1) how I am slicing the dataframe or (2) how I am tinkering with the hyperparameters.
Any insight would be appreciated.
Edit:
To address the comment below, I padded Y_train[4] and Y_train[5] so as to make them be arrays with 3 elements. This still throws the above error.

Related

Optimize dataframe fill and refill Python Pandas

I have changed the column names and have added new columns too.
I am having a numpy array that I have to fill in the respective dataframe columns.
I am getting a delayed response in filling the dataframe using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv("sample.csv")
df = df.tail(1000)
DISPLAY_IN_TRAINING = []
Slice_Middle_Piece_X = slice(None,-1, None)
Slice_Middle_Piece_Y = slice(-1, None)
input_slicer = slice(None, None)
output_slice = slice(None, None)
seq_len = 15 # choose sequence length
n_steps = seq_len - 1
Disp_Data = df
def Generate_DataSet(stock,
df_clone,
seq_len
):
global DISPLAY_IN_TRAINING
data_raw = stock.values # convert to numpy array
data = []
len_data_raw = data_raw.shape[0]
for index in range(0, len_data_raw - seq_len + 1):
data.append(data_raw[index: index + seq_len])
data = np.array(data);
test_set_size = int(np.round(30 / 100 * data.shape[0]));
train_set_size = data.shape[0] - test_set_size;
x_train, y_train = Get_Data_Chopped(data[:train_set_size])
print("Training Sliced Successful....!")
df_train_candle = df_clone[n_steps : train_set_size + n_steps]
if len(DISPLAY_IN_TRAINING) == 0:
DISPLAY_IN_TRAINING = list(df_clone)
df_train_candle.columns = DISPLAY_IN_TRAINING
return [x_train, y_train, df_train_candle]
def Get_Data_Chopped(data_related_to):
x_values = []
y_values = []
for index,iter_values in enumerate(data_related_to):
x_values.append(iter_values[Slice_Middle_Piece_X,input_slicer])
y_values.append([item for sublist in iter_values[Slice_Middle_Piece_Y,output_slice] for item in sublist])
x_values = np.asarray(x_values)
y_values = np.asarray(y_values)
return [x_values,y_values]
x_train, y_train, df_train_candle = Generate_DataSet(df,
Disp_Data,
seq_len
)
df_train_candle.reset_index(drop = True, inplace = True)
df_columns = list(df_train_candle)
df_outputs_name = []
OUTPUT_COLUMN = df.columns
for output_column_name in OUTPUT_COLUMN:
df_outputs_name.append(output_column_name + "_pred")
for i in range(len(df_columns)):
if df_columns[i] == output_column_name:
df_columns[i] = output_column_name + "_orig"
break
df_train_candle.columns = df_columns
df_pred_names = pd.DataFrame(columns = df_outputs_name)
df_train_candle = df_train_candle.join(df_pred_names, how="outer")
for row_index, row_value in enumerate(y_train):
for valueindex, output_label in enumerate(OUTPUT_COLUMN):
df_train_candle.loc[row_index, output_label + "_orig"] = row_value[valueindex]
df_train_candle.loc[row_index, output_label + "_pred"] = row_value[valueindex]
print(df_train_candle.head())
The shape of my y_train is (195, 24) and the dataframe shape is (195, 48). Now I am trying to optimize and make the process work faster. The y_train may change shape to say (195, 1) or (195, 5).
So please can someone tell me what other way (optimized way) for doing the above process? I want a general solution that could fit anything without loosing the data integrity and is faster too.
If teh data size increases from 1000 to 2000 the process become slow. Please advise how to make it faster.
Sample Data df looks like this with shape (1000, 8)
A B C D E F G H
64272 195 215 239 272 22 11 33 55
64273 196 216 240 273 22 11 33 55
64274 197 217 241 274 22 11 33 55
64275 198 218 242 275 22 11 33 55
64276 199 219 243 276 22 11 33 55
The output looks like this:
A_orig B_orig C_orig D_orig E_orig F_orig G_orig H_orig A_pred B_pred C_pred D_pred E_pred F_pred G_pred H_pred
0 10 30 54 87 22 11 33 55 10 30 54 87 22 11 33 55
1 11 31 55 88 22 11 33 55 11 31 55 88 22 11 33 55
2 12 32 56 89 22 11 33 55 12 32 56 89 22 11 33 55
3 13 33 57 90 22 11 33 55 13 33 57 90 22 11 33 55
4 14 34 58 91 22 11 33 55 14 34 58 91 22 11 33 55
Please generate csv columns with 1000 or more lines and see that the program becomes slower. I want to make it faster. I hope this is good to go for understanding.

ValueError while trying to fit to a sklearn model

I'm trying to build a linear regression model with just one predictor variable for a Retail Dataset. The predictor variable I'm trying to use is known as Description.
The Description column contains numerical values of the product category and the data type of the column, initially, was int64.
**InvoiceNo CustomerID StockCode Quantity Description Country Year Month Day dayofWeek HourofDay**
4501 14730.0 3171 2 1324.0 35 2011 3 28 0 12
2220 14442.0 1483 2 368.0 6 2011 1 27 3 9
5736 15145.0 2498 12 2799.0 35 2011 4 26 1 16
3347 12678.0 1809 12 48.0 13 2011 2 28 0 11
14246 14179.0 1510 1 953.0 35 2011 10 18 1 1
When I tried to fit the model with just that, it worked perfectly.
X_train_1 = X_train['Description'].values.reshape(-1, 1)
X_train_1.columns = ['Description']
X_train_1.shape
mod = LinearRegression()
mod.fit(X_train_1, y_train)
However, when I changed the data type of the description variable from int64 to category, it is throwing me this error.
ValueError Traceback (most recent call last)
<ipython-input-48-7243890f1de1> in <module>()
4
5 mod = LinearRegression()
----> 6 mod.fit(X_train_1, y_train)
5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in take_nd(arr, indexer, axis, out, fill_value, allow_fill)
1711 arr.ndim, arr.dtype, out.dtype, axis=axis, mask_info=mask_info
1712 )
-> 1713 func(arr, indexer, out, fill_value)
1714
1715 if flip_order:
pandas/_libs/algos_take_helper.pxi in pandas._libs.algos.take_1d_int64_int64()
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Any idea why this is happening? Any help would be much appreciated.

Python recursive function failing

The issue that I am having is a really strange issue.
What I am trying to accomplish is the following: I am training a neural network using pytorch, and I want to restart my training function if the training loss doesn't decrease, so as to re-initialize the neural network with a different set of weights. The training function is presented below:
def __train__(dp, i, j, net, restarts, epoch=0):
if net == '2CH': model = TwoChannelCNN().cuda()
elif net == 'Siam' : model = SiameseCNN().cuda()
elif net == 'Trad' : model = TraditionalCNN().cuda()
ls_fn = torch.nn.MSELoss(reduce=True)
optim = torch.optim.SGD(model.parameters(), lr=1e-6, momentum=0.9)
epochs = np.arange(100)
eloss = []
for epoch in epochs:
model.train()
train_loss = []
tr_batches = np.array_split(dp.train_set, int(len(dp.train_set)/8))
for tr_batch in tr_batches:
if net == '2CH': loaded_batch = dp.__load2CH__(tr_batch)
elif net == 'Siam': loaded_batch = dp.__loadSiam__(tr_batch)
elif net == 'Trad' : loaded_batch = dp.__load__(tr_batch, i)
for x_batch, y_batch in loaded_batch:
x_var, y_var = Variable(x_batch.cuda()), Variable(y_batch.cuda())
y_pred = torch.clamp(model(x_var), 0, 1)
loss = ls_fn(y_pred, y_var)
train_loss.append(abs(loss.item()))
optim.zero_grad()
loss.backward()
optim.step()
eloss.append(np.mean(train_loss))
print(epoch, np.mean(train_loss))
if epoch == 10 and np.mean(train_loss) > 0.2:
restarts += 1
print('Number of restarts for client {} and fold {}: {}'.format(i,j,restarts))
__train__(dp, i, j, net, restarts, epoch=0)
__plotLoss__(epochs, eloss, 'train', str(i), str(j))
torch.save(model.state_dict(), "Output/client_{}_fold_{}.pt".format(i, j))
So the restarting based on if epoch == 10 and np.mean(train_loss) > 0.2: works, but only sometimes, which is beyond my comprehension. Here is an example of the output:
0 0.5000133737921715
1 0.4999906486272812
2 0.464298670232296
3 0.2727506290078163
4 0.2628978116512299
5 0.2588871221542358
6 0.25728522151708605
7 0.25630473804473874
8 0.2556223524808884
9 0.25522999209165576
10 0.25467908215522767
Number of restarts for client 5 and fold 1: 3
0 0.10957609283713009
1 0.02840371729924134
2 0.021477583368030594
3 0.017759160268232682
4 0.015173796122947827
5 0.013349939693290782
6 0.011949078906879265
7 0.010810676779671655
8 0.00987362345259362
9 0.009110640348696108
10 0.008239036202623808
11 0.007680381585537574
12 0.007171026876221333
13 0.006765962297888837
14 0.006428168776848068
15 0.006133011780953467
16 0.005819878347673745
17 0.005572605537395361
18 0.00535818950227004
19 0.005159409143814457
20 0.0049763926251294235
21 0.004738794513338235
22 0.004578812885309958
23 0.004428663117960554
24 0.004282198464788351
25 0.004145324644400691
26 0.004018862769889626
27 0.0039044404603504573
28 0.0037960831121495744
29 0.0036947361258523586
30 0.0035982220717533267
31 0.0035018146670104723
32 0.0034150678806059887
33 0.0033372560733512698
34 0.003261332974241583
35 0.00318166259540763
36 0.003108531899014735
37 0.0030385089141125848
38 0.002977990984523103
39 0.0029195284016142937
40 0.002870084639441188
41 0.0028180573325994373
42 0.0027717544270049643
43 0.002719321814503495
44 0.0026704726860933194
45 0.0026204266263459316
46 0.002570544072460258
47 0.0025225681523167224
48 0.0024814611543610746
49 0.0024358948737413116
50 0.002398673941639636
51 0.0023606415423654587
52 0.002330436484101057
53 0.0022891738560574027
54 0.002260655496376241
55 0.002227568955708719
56 0.002191826719741698
57 0.0021609061182290058
58 0.0021279943092100666
59 0.0020966088490456513
60 0.002066195117003474
61 0.0020381672924407895
62 0.002009863329306995
63 0.001986304977759602
64 0.0019564831849032487
65 0.0019351609173580756
66 0.0019077356409993626
67 0.0018875047204855945
68 0.0018617453310780547
69 0.001839518720600381
70 0.001815563331498197
71 0.0017149778925132932
72 0.0016894878409248121
73 0.0016652211918212743
74 0.0016422999463582074
75 0.0016183732903472788
76 0.0015962369183098418
77 0.0015757764620279887
78 0.0015542267022799728
79 0.0015323152910759318
80 0.0014337954093957706
81 0.001410489170542867
82 0.0013871921329466962
83 0.0013641994057461773
84 0.001345829172682187
85 0.001322142209181493
86 0.00130379223035348
87 0.001282231878045458
88 0.001263879886683956
89 0.001243419097817167
90 0.0012279346547037929
91 0.001206978429649382
92 0.0011871445969959496
93 0.001172510546330841
94 0.0011529557384797045
95 0.0011350733004023273
96 0.001118382818282214
97 0.001103347793609089
98 0.0010848538354748599
99 0.0010698940242660911
11 0.2542190085053444
12 0.2538975296020508
So here you can see that the restarting is correct from the 3rd restart, but then, since the network converges, the training should be complete, but the function restarts AGAIN after the 99th epoch (for an unknown reason), and somehow starts at the 11th epoch, which also makes no sense as I am explicitly specifying epoch = 0 whenever the function starts or restarts. I should also add that, SOMETIMES, the function completes correctly after the epoch 99, when convergence has been achieved, and does not restart.
So my question is, why does this piece of code produce inconsistent results and outcomes? What am I missing here? Thanks in advance for any suggestions.
You are restarting the training by calling __train__ a second time in the case if epoch == 10 and np.mean(train_loss) > 0.2: but you never terminate the first loop.
So, after the second training has converged, the outer loop continues at epoch 11.
What you need is a break statement after the inner call to __train__.

TypeError: '<' not supported between instances of 'str' and 'int' while doing PCA for k-means clustering

I am trying to apply Kernel Principle Component Analysis on a dataset without a dependent variable to do a cluster analysis with k-means, so that I can learn how to do so. Here is a sample of my dataset(according to the scenario, this is a dataset of a shopping mall, and the shopping mall wants to discover the segments of its customers according to the data below):
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76
7 Female 35 18 6
8 Female 23 18 94
9 Male 64 19 3
10 Female 30 19 72
11 Male 67 19 14
First, I omitted CustomerID column and then encoded the gender column to be able to apply kernel PCA. Here is how I did it:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the mall dataset with pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, 1:5].values
df = pd.DataFrame(X)
#df is in order to visualize the "X" on variable explorer
#Encoding independent categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
After executing this code, I could get the array with float64 Type. The sample from the array I created is below:
0 1 19 15 39
0 1 21 15 81
1 0 20 16 6
1 0 23 16 77
1 0 31 17 40
1 0 22 17 76
1 0 35 18 6
1 0 23 18 94
0 1 64 19 3
1 0 30 19 72
0 1 67 19 14
And then, I wanted to apply Kernel PCA to get the principal components which I will use at k-means. However, when I try to execute the code below, I get the error "TypeError: '<' not supported between instances of 'str' and 'int'".
# Applying Kernel PCA
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 'None', kernel = 'rbf')
X = kpca.fit_transform(X)
explained_variance = kpca.explained_variance_ratio_
Even if I encoded my categorical data and I don't have any strings in my dataset, I cannot understand why it gives this error. Is there anyone that could help?
Thank you very much in advance.
n_components = 'None' is the problem. you should not put a string here...
use:
kpca = KernelPCA(n_components = None, kernel = 'rbf')
I suspect this is what is happening:
This is an error of an included file, or some code that is running, prior to your running code. The "TypeError: '<' to which this is referring is a string "<error>". Which is what something prior to your code is returning.

Theano dimensionality error - target dimensions

I am using lasagne's Conv3DDNNLayer, and have input dimensions of (N x 1 x 9 x 9 x 9), where each 9x9x9 cube is my sample to be classified.
Therefore I have a target dimension of (N x 1), with each entry corresponding to a cube. This is raising the error:
Bad input argument to theano function with name "Conv_Net_1.py:45" at index 1(0-based)', 'Wrong number of dimensions: expected 1,
got 2 with shape (324640, 1).')ยด
Which dimensions should I have my targets in in this case?
11 dtensor5 = TensorType('float32', (False,)*5)
12 input_var = dtensor5('X_Train')
13 target_var = T.ivector('Y_train')
14
15 X_train, Y_train = DP.data_gen( '/home/Upload/Smalls', 9)
16 print X_train.shape
17 print Y_train.shape
18 # Build Neural Network:
19 input = lasagne.layers.InputLayer((None, 1, 9, 9, 9), input_var=input_var)
20
21 l_conv_1 = lasagne.layers.dnn.Conv3DDNNLayer(input, 20, (2,2,2))
22
29 l_hidden1 = lasagne.layers.DenseLayer(l_conv_1, num_units=256,nonlinearity=lasagne.nonlinearities.rectify,W=l asagne.init.HeNormal(gain='relu'))
30
31 l_hidden1_dropout = lasagne.layers.DropoutLayer(l_hidden1, p=0.5)
32
33 output = lasagne.layers.DenseLayer(l_hidden1_dropout, num_units=2, nonlinearity = lasagne.nonlinearities.soft max)
34
35 ##
36 prediction = lasagne.layers.get_output(output)
37 loss = T.mean(lasagne.objectives.categorical_crossentropy(prediction, target_var)
39
40 # Get list of all trainable parameters in the network.
41 params = lasagne.layers.get_all_params(output, trainable=True)
42 updates = lasagne.updates.nesterov_momentum(loss, params, learning_rate=0.01, momentum=0.3)
43
44 ##
45 train_fn = theano.function([input_var, target_var], loss, updates=updates)
46
47 ##
48 for epoch in range(500):
49 print('training')
50 loss = train_fn(X_train, Y_train)
51 print(loss.type)
52 print("Epoch %d: Loss %g" % (epoch + 1, loss))
53
54
55 ##
56 test_prediction = lasagne.layers.get_output(output, deterministic=True)
57 predict_fn = theano.function([input_var], T.argmax(test_prediction, axis=1))
edit - added code
Thanks!
In case any one is interested, it was because the data was (N, 1) not (N, ).
seemed to solve the problem! - on to the next..

Categories

Resources