I'm testing out a Decision Tree for the first time and am getting a perfect score for my algorithm's performance. This doesn't make sense because the dataset that I am using is AAPL stock price for a bunch of different variables which obviously the algorithm can't detect perfectly.
CSV:
Date,Open,High,Low,Close,Adj Close,Volume
2010-01-04,10430.6904296875,10604.9697265625,10430.6904296875,10583.9599609375,10583.9599609375,179780000
2010-01-05,10584.5595703125,10584.5595703125,10522.51953125,10572.01953125,10572.01953125,188540000
I think the reason it might not be working is because I am essentially just feeding in the answers when training the model and it is just regurgitating those when I try and score the model.
Code:
# Data Sorting
df = pd.read_csv('AAPL_test.csv')
df = df.drop('Date', axis=1)
df = df.dropna(axis='rows')
inputs = df.drop('Close', axis='columns')
target = df['Close']
print(inputs.dtypes)
print(target.dtypes)
# Changing dtypes
lab_enc = preprocessing.LabelEncoder()
target_encoded = lab_enc.fit_transform(target)
# Model
model = tree.DecisionTreeClassifier()
model.fit(inputs, target_encoded)
print(f'SCORE = {model.score(inputs, target_encoded)}')
I've also thought about randomizing the order of the CSV files, that could help but I'm not sure how I would do that. I could randomize the df at the top of the code but I'm pretty sure that, that would equally skew the results for both dataframes and therefore there would be no difference to what I am doing now. Otherwise, I could individually randmoize the datasets but I think that would mess with the model training or scoring because the test data won't be associated with the right data? I'm not too sure.
Most probably your model is overfitted. I think you did not split your dataset into two part: One is for training and the other is testing. Test data will help you to understand if your model overfit or underfit.
For more information:
Overfitting
How to Prevent Overfitting
The idea of this model is that it learns, through neural networks, to perform the multiplication of two feactures, so I created a training dataset with multiplications of random numbers from 0 to 100. As the idea is that it learns to multiply in any situation, I created training data a) with random numbers up to 100 and b) with random numbers from 1000 to 5000.
I created the neural network below for this, however it does not fit well with the test data “b”.
model = tf.keras.Sequenenter code heretial()
model.add(tf.keras.layers.Dense(units = 2,input_dim = 2))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.Dense(units = 64,activation='relu'))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.Dense(units = 32,activation='relu'))
model.add(tf.keras.layers.Dense(units = 1))
model.compile(optimizer='adam', loss = 'mean_squared_error')
Compared to the "a" test data, the prediction makes sense. But comparing with the test data "b", it presents a similar curve, but with very distant values.
Data test x Data predict "a"
Data test x Data predict "b"
Data predict "b"
If you want to see my complete code:
https://colab.research.google.com/drive/1rdAhZnHlxyXHHDF2D_grog05oDwYbXHa?usp=sharing
Could you help me with my model to generalize well to data much larger than training data?
Thanks!
Using the scaling provided by you in the comments of your notebook results in a different scaling for training and test data. For example if there is a 100 value in your training data its normalized value should be the same in your test data, which is not the case right now. The easiest way to normalize the data in your case is simply doing it from the beginning, e.g. here
df = pd.DataFrame(data=a, columns=['a'])
df['b']= b
df['mult'] = df['a']*df['b']
# Scale your data here
In any case, I am not sure if this would solve the problem.
I am attempting to apply data augmentation to a TFRecord dataset after it has been parsed. However, when I check the size of the dataset before and after mapping the augmentation function, the sizes are the same. I know the parse function is working and the datasets are correct as I have already used them to train a model. So I have only included code to map the function and count the examples afterward.
Here is the code I am using:
num_ex = 0
def flip_example(image, label):
flipped_image = flip(image)
return flipped_image, label
dataset = tf.data.TFRecordDataset('train.TFRecord').map(parse_function)
for x in dataset:
num_ex += 1
num_ex = 0
dataset = dataset.map(flip_example)
#Size of dataset
for x in dataset:
num_ex += 1
In both cases, num_ex = 324 instead of the expected 324 for non-augmented and 648 for augmented. I have also successfully tested the flip function so it seems the issue is with how the function interacts with the dataset. How do I correctly implement this augmentation?
When you apply data augmentation with the tf.data API, it is done on-the-fly, meaning that every example is transformed as implemented in your method. Augmenting data this way does not mean that the number of examples in your pipeline changes.
If you want to use every example n times, simply add dataset = dataset.repeat(count=n). You might want to update your code to use tf.image.random_flip_left_right, otherwise the flip is done the same way each time.
In your example the second time you check num_ex, dataset only contains the flipped images so 324.
Furthermore if you have a large dataset, larger than 324, you might want to look into online data augmentation. In this case, during training the dataset is augmented differently every epoch, and you only train on the augmented data not on the original dataset. This helps the trained model generalise better. (https://www.tensorflow.org/tutorials/images/data_augmentation)
I have scraped a lot of ebay titles like this one:
Apple iPhone 5 White 16GB Dual-Core
and I have manually tagged all of them in this way
B M C S NA
where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core)
Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles.
I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models.
I want to considering this features:
* Position
- from the beginning of the title
- to the end of the listing
* Orthographic features
- current word contains a digit
- current word is capitalized
....
I can't understand how can I give all this info to the library. The official doc lacks a lot of information
My class are Brand, Model, Size, Color, NA
what does the input file of the SVM algo must contain?
how can I create it? could I have an example of that file considering the 4 features that I put as example in my question? Can I also have an example of the code that I must use to elaborate the input file ?
* UPDATE *
I want to represent these features... How can I must do?
Identity of the current word
I think that I can interpret it in this way
0 --> Brand
1 --> Model
2 --> Color
3 --> Size
4 --> NA
If I know that the word is a Brand I will set that variable to 1 (true).
It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D).
N-gram substring features of current word (N=4,5,6)
No Idea, what does it means?
Identity of 2 words before the current word.
How can I model this feature?
Considering the legend that I create for the 1st feature I have 5^(5) combination)
00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44
How can I convert it to a format that the libsvm (or scikit-learn) can understand?
Membership to the 4 dictionaries of attributes
Again how can I do it?
Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries.
Exclusive membership to dictionary of brand names
I think that like in the 4. feature I must use a bool variable. Do you agree?
Here's a step-by-step guide for how to train an SVM using your data and then evaluate using the same dataset. It's also available at http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f. At the url you can also see the output of the intermediate data and the resulting accuracy (it's an iPython notebook)
Step 0: Install dependencies
You need to install the following libraries:
pandas
scikit-learn
From command line:
pip install pandas
pip install scikit-learn
Step 1: Load the data
We will use pandas to load our data.
pandas is a library for easily loading data. For illustration, we first save
sample data to a csv and then load it.
We will train the SVM with train.csv and get test labels with test.csv
import pandas as pd
train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""
with open('train.csv', 'w') as output:
output.write(train_data_contents)
train_dataframe = pd.read_csv('train.csv')
Step 2: Process the data
We will convert our dataframe into numpy arrays which is a format that scikit-
learn understands.
We need to convert the labels "B", "M", "C",... to numbers also because svm does
not understand strings.
Then we will train a linear svm with the data
import numpy as np
train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)
print "train labels: "
print train_labels
print
print "train features:"
print train_features
We see here that the length of train_labels (5) exactly matches how many rows
we have in trainfeatures. Each item in train_labels corresponds to a row.
Step 3: Train the SVM
from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)
Step 4: Evaluate the SVM on some testing data
test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""
with open('test.csv', 'w') as output:
output.write(test_data_contents)
test_dataframe = pd.read_csv('test.csv')
test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])
test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)
results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"
Links & Tips
Example code for how to load LinearSVC: http://scikitlearn.org/stable/modules/svm.html#svm
Long list of scikit-learn examples: http://scikitlearn.org/stable/auto_examples/index.html. I've found these mildly helpful but
often confusing myself.
If you find that the SVM is taking a long time to train, try LinearSVC
instead: http://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
Here's another tutorial on getting familiar with machine learning models: http://scikit-learn.org/stable/tutorial/basic/tutorial.html
You should be able to take this code and replace train.csv with your training data, test.csv with your testing data, and get predictions for your test data, along with accuracy results.
Note that since you're evaluating using the data you trained on the accuracy will be unusually high.
I echo the comment of #MarcoPashkov but will try to elaborate on the LibSVM file format. I find the documentation comprehensive yet hard to find, for the Python lib I recommend the README on GitHub.
An important piece to recognize is that there is a Sparse format where all features which are 0 get removed and a Dense format where features which are 0 are not removed. These two are equivalent examples of each taken from the README.
# Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]
The y variable stores a list of all the categories for the data.
The x variable stores the feature vector.
assert len(y) == len(x), "Both lists should be the same length"
The format found in the Heart Scale Example is a Sparse format where the dictionary key is the feature index and the dictionary value is the feature value while the first value is the category.
The Sparse format is incredibly useful while using a Bag of Words Representation for your feature vector.
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
For an example using the feature vector you started with, I trained a basic LibSVM 3.20 model. This code isn't meant to be used but may help in showing how to create and test a model.
from collections import namedtuple
# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.
Category = namedtuple("Category", ["index", "name"])
Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])
# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.
categories = dict()
for index, name in enumerate("B M C S NA".split(' ')):
# LibSVM expects index to start at 1, not 0.
categories[name] = Category(index + 1, name)
categories
Out[0]: {'B': Category(index=1, name='B'),
'C': Category(index=3, name='C'),
'M': Category(index=2, name='M'),
'NA': Category(index=5, name='NA'),
'S': Category(index=4, name='S')}
# Faked set of CSV input for example purposes.
csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
NA,12,0,0,1""".split("\n")
# We just ignore the header.
header = csv_input_lines[0]
# A list of Feature namedtuples, this will be trained as lists.
features = list()
for line in csv_input_lines[1:]:
split_values = line.split(',')
# Create a Feature with the values converted to integers.
features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))
features
Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),
Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),
Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),
Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),
Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]
# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.
y = map(lambda f: f.category_index, features)
# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.
x = map(lambda f: list(f)[1:], features)
from svmutil import svm_parameter, svm_problem, svm_train, svm_predict
# Barebones defaults for SVM
param = svm_parameter()
# The (Y,X) parameters should be the train dataset.
prob = svm_problem(y, x)
model=svm_train(prob, param)
# For actual accuracy checking, the (Y,X) parameters should be the test dataset.
p_labels, p_acc, p_vals = svm_predict(y, x, model)
Out[3]: Accuracy = 100% (5/5) (classification)
I hope this example helps, it shouldn't be used for your training. It is meant as an example only because it is inefficient.
I am trying to use scikit-learn 0.12.1 to:
train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob) pairs.
If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
Building on larsman's excellent answer, I ended up with this:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.