Spark transformations and preservation of RDD element ordering - python

I'm trying to understand how (an if) the piece of code below works. In particular, what I don't understand is WHY does this code ASSUME -maybe correctly- that the order of elements in the RDD is preserved subsequent to mappings. This is in essence an example of the same question asked here Mind blown: RDD.zip() method. I don't understand why/how the last line quarantees that the zip actually zips the correct prediction with the corresponding label from the testData RDD? One of the comments mentions that if the RDD, testData in this case, is ordered in some way, then map will preserve that order. However, predictions is an entirely different RDD.. I can't see how or why this works!!
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
## Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = labeledDataRDD.randomSplit([0.7, 0.3])
## Train a RandomForest model
model = RandomForest.trainClassifier(trainingData, numClasses=2510,
categoricalFeaturesInfo={},numTrees=100,
featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

Related

Why am I getting perfect on my decision tree ML algorithm training?

I'm testing out a Decision Tree for the first time and am getting a perfect score for my algorithm's performance. This doesn't make sense because the dataset that I am using is AAPL stock price for a bunch of different variables which obviously the algorithm can't detect perfectly.
CSV:
Date,Open,High,Low,Close,Adj Close,Volume
2010-01-04,10430.6904296875,10604.9697265625,10430.6904296875,10583.9599609375,10583.9599609375,179780000
2010-01-05,10584.5595703125,10584.5595703125,10522.51953125,10572.01953125,10572.01953125,188540000
I think the reason it might not be working is because I am essentially just feeding in the answers when training the model and it is just regurgitating those when I try and score the model.
Code:
# Data Sorting
df = pd.read_csv('AAPL_test.csv')
df = df.drop('Date', axis=1)
df = df.dropna(axis='rows')
inputs = df.drop('Close', axis='columns')
target = df['Close']
print(inputs.dtypes)
print(target.dtypes)
# Changing dtypes
lab_enc = preprocessing.LabelEncoder()
target_encoded = lab_enc.fit_transform(target)
# Model
model = tree.DecisionTreeClassifier()
model.fit(inputs, target_encoded)
print(f'SCORE = {model.score(inputs, target_encoded)}')
I've also thought about randomizing the order of the CSV files, that could help but I'm not sure how I would do that. I could randomize the df at the top of the code but I'm pretty sure that, that would equally skew the results for both dataframes and therefore there would be no difference to what I am doing now. Otherwise, I could individually randmoize the datasets but I think that would mess with the model training or scoring because the test data won't be associated with the right data? I'm not too sure.
Most probably your model is overfitted. I think you did not split your dataset into two part: One is for training and the other is testing. Test data will help you to understand if your model overfit or underfit.
For more information:
Overfitting
How to Prevent Overfitting

CatBoost Post-Training Feature Information

I would like to understand how I can access information about numerical and categorical features after training a CatBoost model. For the sake of example, here's some toy code:
import pandas as pd
from catboost import CatBoostClassifier, Pool
train_pool = Pool(pd.DataFrame({'size': [1,1,2,1],
'shape': ['square','square','square', 'circle']}),
[1,1,0,1],
feature_names = ['size','shape'],
cat_features= ['shape'])
model = CatBoostClassifier(iterations=2,
cat_features = ['shape'],
ctr_leaf_count_limit=1)
model.fit(train_pool, plot=False)
I would now like to run a function on the model object to obtain the following:
Numerical Feature size has minimum value 0, and max value 1 (this should be part of CatBoosts split logic for numerical features)
Categorical Feature shape has the following training values:
values=['square', None].
Notice that circle is not in values because the car_leaf_count_limit=1 would have selected the most occurring value, which in this case is 'square'. I've put None here because I'm pretty sure cat boost will assign None to any unseen classes.
Next, I've chosen the above data example to make sure that CatBoost decides to split on shape=='square'. Ideally I'd like to see an array used_values=['square'] which emphasizes that there was at least one split on this square value.
It's important to emphasize here that I want to operate on the model object only. Obviously, one can get some of these details by running functions onto of the training data. My motivation is to double-make-sure that I completely understand the training-range of inputs into the model, and what it may do to them in preprocessing.

Why is sklearn.metrics support value changing every time?

I'm working on training a supervised learning keras model to categorize data into one of 3 categories. After training, I run this:
dataset = pandas.read_csv(filename, header=[0], encoding='utf-8-sig', sep=',')
# split X and Y (last column)
array = dataset.values
columns = array.shape[1] - 1
np.random.shuffle(array)
x_orig = array[:, 1:columns]
testy = array[:, columns]
columns -= 1
# normalize data
scaler = StandardScaler()
testx= scaler.fit_transform(x_orig)
#onehot
testy = to_categorical(testy)
# load weights
save_path = "[filepath]"
model = tf.keras.models.load_model(save_path)
# gets class breakdown
y_pred = model.predict(testx, verbose=1)
y_pred_bool = np.argmax(y_pred, axis=1)
y_true = np.argmax(testy, axis=1)
print(sklearn.metrics.precision_recall_fscore_support(y_true, y_pred))
sklearn.metrics.precision_recall_fscore_support prints, among other metrics, the support for each class. Per this link, support is the number of occurrences of each class in y_true, which is the true labels.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
My problem: each run, support is different. I'm using the same data, and support for each class always adds up the same (but different than the total in the file – which I also don’t understand), but the number per class differs.
As an example, one run might say [16870, 16299, 7807] and the next might say [17169, 15923, 7884]. They add up the same, but each class differs.
Since my data isn't changing between runs, I'd expect support to be identical every time. Am I wrong? If not, what's going on? I've tried googling, but didn't get any useful results.
Potentially useful information: when I run sklearn.metrics.classification_report, I have the same issue, and the numbers from that match the numbers from precision_recall_fscore_support.
Sidenote: unrelated to above question, but I couldn't google-fu an answer to this one either, I hope that's ok to include here. When I run model.evaluate, part of the printout is e.g. 74us/sample. What does us/sample mean?
Add:
np.random.seed(42)
before you shuffle the array at
np.random.shuffle(array)
The reason for this is without seeding np.shuffle will create a different result each time. Thus when you feed the array into the model it will return a different result. Seeding allows you to shuffle it the same each time, thus creating reproducible results.
Or you can not shuffle and get the same array each time to feed into the model. Either or both methods will ensure reproducibility within the model.

Passing the argument from a previous step in sklearn pipelines

I have a dataset with numerical and categorical features on which I am trying to fit a classifier. My idea was to preprocess the categorical data first using Pandas such that my dataset can be written as (to borrow MATLAB's concatenation notation)
X_train = [ X_train_num, X_train_cat ]
and
X_test = [ X_test_num, X_test_cat ].
To deal with numerical data, I did the following:
# define concatenation of arrays so we can assemble the various parts
# that are preprocessed differently in the pipelines
def concat(a1, a2):
return np.concatenate((a1, a2), axis=1)
# pipeline to preprocess, reassemble, and fit our models
trainPipeline = Pipeline([
('preprocessing', numPipeline), # scale numerical data
('assembling', FunctionTransformer(concat, kw_args={'a2' : X_train[nominalFeatures]})), # wrong, but how?
('classifying', LogisticRegression())
])
The issue here is that when I pass X_train to the pipeline, it only extracts X_train_num to scale it in the first step, which is why I need to reassemble X_train_num_scaled with X_train_cat = X_train[nominalFeatures] together in the second step. The code above will obviously not work when I use X_test as an input for prediction unless I find a way to access the initial input from the first step and use that in the concatenation step.
I have tried to look at trainPipeline.steps[0] and down the list for the initial variable name but found nothing that could help me. What am I missing?
As #Vivek Kumar states, you should implement FeatureUnion() method in order to construct that pipe. It is usually used to concatenate inputs to let the model train on the extended data. So, in your case the pipe should look as the following:
def concat(a1, a2):
return np.concatenate((a1, a2), axis=1)
subpipe = Pipeline(
[('concat', FunctionTransformer(concat, kw_args={'a2': X_train[nominalFeatures]})),
('preproc', numPipeline())])
union = FeatureUnion(
[('prep_data', subpipe),
('raw_data', FunctionTransformer(concat, kw_args={'a1': X_train_num}))])
pipe = Pipeline(
[('union', union),
('logreg', LogisticRegression())])
Then, you should be able to perform pipe.predict(X_test, y) provided X_test is already preprocessed.
Quickcheck: I applied numPipeline() function to X_train[nominalFeatures] and let X_train_num be as it is. I hope that is what you desire.

Training a sklearn LogisticRegression classifier without all possible labels

I am trying to use scikit-learn 0.12.1 to:
train a LogisticRegression classifier
evaluate the classifer on held out validation data
feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
Here's a workaround. Make sure you have a list of all classes called all_classes. Then, if clf is your LogisticRegression classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
produces a list of (cls, prob) pairs.
If what you want is an array like that returned by predict_proba, but with columns corresponding to sorted all_classes, how about:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
Building on larsman's excellent answer, I ended up with this:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob is an [n_samples, n_classes] array just like the output from predict_proba, except now it includes 0 probabilities for the previously unseen classes.

Categories

Resources