So the question revolve around character segmentation. My problem is the following:
I want to segment characters, based on y-axis pixel numbers, following this ( in python) : source
What i already done to get here:
read image io.imread
swap axis np.swapaxes
sum the numbers of each column (now row) - > got y array
I got to the point where i have two arrays (both of them are exactly what I use);
x = [94, 72, 2, 2, 1, 66, 1, 13, 1, 16, 1, 8, 1, 5, 1, 47, 1, 1, 1, 3, 1, 17, 14, 87, 100]
y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y is the thresholded binary array of the y-axis, (0 if pixel count < 1275, 1 otherwise)
x is the itertools groupby version of the y array.
I have the avarege distance of the letters too, so i know which are the wrongly segmented parts. (according to the x, the avg is 28.)
And this is the image i would like to segment, it has 4 letters, "a","l","m","a":
picture which i would like to segment
So in theory, if i could somehow merge the parts where the number of the ones are lower than the avg, and turn the "separating" zeros to ones, i should get a list which is as long as the width of the image, and has zeros only where it should have.
If i use cv.line on the y array, it indeed does segment the characters, drawing a red line where the array is 0, but it oversegments it.
oversegmented image
What i would like to do is "modify" or just re-do the y array, based on the x.
I tried a lot of methods, but i just cant find the algorithm to go over the x, find the wrong values, delete the zeros inbetween, and modify a list according to that.
My best shot is this easy, nothing-like-my-original-idea piece:
num = 0
betterarray = []
for i in range(len(y)):
if( num == 1 and y[i] == 0 and y[i+1] == 1):
betterarray.append(1)
else :
betterarray.append(y[i])
num = y[i]
It does deletes the (most of the time) one column only bad segmentations, but as I guessed, it does delete some good segmentations aswell.
You should identify the wrongly segmented letters by comparing your segments to the peak segment average and modifying the x array by combining any peak segments that are smaller than the average.
def locate_oversegmentation(array, mask, avg):
length = len(array)
for i in range(length):
// less than average peak
if (mask[i]==1 and array[i]<=avg):
if (i-2>=0):
// previous peak is less than avg
if (array[i-2]<=avg):
mask[i-1] = 1
if (i+2<=length):
// next peak is less than avg
if (array[i+2]<=avg):
mask[i+1] = 1
return mask
This function takes in array x and a compact version of array y by grouping consecutive 0's and 1's. compact_y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] It will return a new array changing 0's between below avg peaks to 1's. The output array is a guide to combining peaks in array x.
Example:
x = [94, 72, 2, 2, 1, 66, 1, 13, 1, 16, 1, 8, 1, 5, 1, 47, 1, 1, 1, 3, 1, 17, 14, 87, 100]
compact_y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
avg = 28
guide = locate_oversegmentation(x, compact_y, avg)
>> guide = [0,1,0,1,0,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,0,1,0,1,0]
Apply the guide on array x by adding consecutive 1's together in array x.
Well I am trying to classify a data-set and i am trying to find the best way, therefore i tested multiple models to evaluate them, after double checking with accuracy and cross-validation i apply to the test data-set and unusual results coming up with the SVM model.
# Separating X and y
X = df_train.drop('*****', axis=1)
y = df_train.churn
x_test = df_test.drop('**', axis=1)
x_test.shape
# Scaling
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=list(X.columns))
#RandomOverSampler for correcting imbalanced classes
rus = RandomOverSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
X_resampled.shape
(7304, 14)
#Separating X and y test and train for metrics
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=20)
X_train.shape
(5843, 14)
#Classifires of Sklearn
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "QDA"]
gaus = 'GaussianProcessClassifier(1.0 * RBF(1.0)) , "Gaussian Process"'
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
DecisionTreeClassifier(max_depth=3),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1, max_iter=1000),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]
#Appling all models and printing results
i = 0
while i < len(classifiers):
model = classifiers[i]
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(names[i], ", Accuraccy: ", accuracy_score(y_test, model.predict(X_test)))
i += 1
Nearest Neighbors , Accuraccy: 0.9240246406570842
Linear SVM , Accuraccy: 0.7809719370294319
RBF SVM , Accuraccy: 0.997946611909651
Decision Tree , Accuraccy: 0.8446269678302533
Random Forest , Accuraccy: 0.8042436687200547
Neural Net , Accuraccy: 0.891170431211499
AdaBoost , Accuraccy: 0.8473648186173853
Naive Bayes , Accuraccy: 0.8220396988364134
QDA , Accuraccy: 0.8576317590691307
#Best model from the comparison
best = SVC(gamma=2, C=1)
best.fit(X_resampled, y_resampled)
print(cross_val_score(best, X_resampled, y_resampled, cv=5))
best.predict(x_test)
[0.99657769 1. 1. 1. 1. ]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0], dtype=int64)
First unusual results all zeros even tho that the acc annd cross-val seems ok
#Best model from the comparison
best = SVC(kernel="linear", C=0.021)
best.fit(X_resampled, y_resampled)
print(cross_val_score(best, X_resampled, y_resampled, cv=5))
best.predict(x_test)
[0.77412731 0.77275838 0.77549624 0.77275838 0.76986301]
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1], dtype=int64)
Second unusual , even tho i have around 77% acc, even if the first model was the "perfect" one, we had to expect zeros, similar results (all 1) produces the KNN, but other examples produce more "logical" results like:
#Best model from the comparison
best = DecisionTreeClassifier(max_depth=3)
best.fit(X_resampled, y_resampled)
print(cross_val_score(best, X_resampled, y_resampled, cv=5))
best.predict(x_test)
[0.8275154 0.83093771 0.82683094 0.8275154 0.83493151]
array([0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0,
0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1,
0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,
0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0,
0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0,
1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1,
1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,
1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,
0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,
0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,
0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
1, 1], dtype=int64)
I really don't know what i am doing wrong , i will appreciate the help!
Thank you!
I am trying to roughly reproduce C-classification SVM cross-validation skill in R (e1071 package) using Python (scikit-learn), but am getting nowhere near R's prediction skill. Given the below training and test data (which have been smoothed from datasets much larger in length), R prediction skill is 0.87 (where 1 is perfect), and Python skill is 0.55 which is not much better than guessing. Note that I am not at all trying to get identical results, I am just hopeful that if R can do reasonably well then so could Python on the same dataset. I have split my data 50-50 (training and test), and am trying to predict binomial results from floats. R and Python code are given below. All the default SVM arguments that I checked were the same between both R and Python (gamma, C(cost), shrinking, tol, etc).
R code:
library("e1071")
data <- c(-108.604150711185, -131.880188127745, -18.3017441809734, 32.011639982337, -71.6651360870381, -107.587087751331, 21.316311739316, -36.015324564807, 138.22302265079, 47.9322592065447, -129.007749732555, -150.41808326425, -141.00589707504, -105.912063885407, 76.2956568174239, 141.457541434218, -20.6676395937811, -226.505644333494, -151.229861588686, -160.18717733968, -107.01667849677, -7.52794131287047, -93.1147621027003, 5.59630172385392, 38.741091785708, -32.9061390503546, -78.5031246062325, -9.64080356337477, -54.1430873201472, -108.127067430103, -12.2589074567133, 129.212940940854, 132.670728015743, 107.075153550768, 167.176831103164, -20.6839530330714, 102.677911281291, -109.423698849103, -154.454318421757, 140.52342226202, 110.184351332211, -16.6842057565239, -11.1688984829787, 178.441845032635, 37.0689292040101, 166.610506783818, -79.2764182099804, 99.1136693164655, 82.0929274697289, 15.1752041486536, 178.489001782771, 145.332200036106, -185.977800430997, -90.5440753976243, 78.0459300120412, 144.297553387967, 99.5945824957091, 110.803195137024, 81.3094331750562, -396.825240330405, -166.038928089807, -78.863983688682, 138.309908804212, -148.647304302406, -2.23135233624276, 129.411511929621, -111.664324254549, -96.4151180340831, 129.219227225386, 90.7050615157428, 141.986869866474, 93.0147970463941, 142.807435791073, -75.8426755946232, 122.537973092667, 117.078515092191, 134.166968023265, 90.8512172789568, 146.367129646428, 125.539182526718, -70.485058023267, -46.967575223949, 116.210349687502, -91.2992704167832, 104.052231138142, -114.580693287221, -82.9991067628608, -111.649187979413)
class <- as.factor(c(0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0))
df_training <- data.frame(data, class)
data <- c(133.75999742845, 22.9386702890105, -126.959902277009, -116.317935595297, -33.9418594804197, -49.0102540773413, -159.266630498512, -8.92296705690401, 114.328300224712, 66.0706175847251, -154.385344188283, 70.7868284982941, -28.334490887314, 118.755307949047, 154.362286178401, 101.331675190569, 96.2196681290104, 99.5694296232446, 210.160787371823, 65.8474210711036, -125.475676456606, 66.7541385125748, -161.001356357477, -40.1416817172267, 38.6877489907967, -7.12706914419719, -10.3967176519225, -80.6831091111636, 128.604227270616, 75.4219966516171, 184.951786958864, 90.9170782990185, 66.7190886024699, 81.377280661573, -82.4053965286415, -65.6718687269108, 61.1679518726262, 190.532649096311, 199.917670153196, 104.558442558929, 113.747065157369, 106.640501329133, 80.593201532054, 75.0176280888154, 155.538654396817, 30.0548798029353, 116.900219512636, 131.431417509576, 33.3308447581156, -121.191534016935, -80.4203785670198, 157.737407847885, 66.5956228628815, 50.8340706561446, -113.713450848071, -18.7787225270887, 113.832326071127, -45.5884280143408, 221.782395098832, 70.1660982367319, 235.005982636939, 80.8180320055801, -74.7107276814795, 133.925782624001, 97.9261686360971, -127.954532027281, 58.9295075974962, 96.1702797891484, -49.6048543914143, -42.1842037639683, -235.694708213157, 13.4862841916787, 126.396462591781, 214.297316240176, 125.148658464391, 84.8887673204376, 78.2717096234718, 139.677936314095, -168.649300541479, 103.40253638232, 69.2727189156141, 153.017155534869, -238.07168745534, -166.929968475244, 113.414489211719, 85.5520123243496, 120.582346886614, -214.850084749638, 96.8090523924549)
class <- as.factor(c(1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1))
df_test <- data.frame(data, class)
#train model
best.svm <- best.tune(svm,
class~data,
data=df_training,kernel = 'radial',cost = 1, gamma = 0.01,
type = "C-classification")
#make predictions
TrainingPredictions<-predict(best.svm,df_training,type="class")
TestPredictions <- predict(best.svm,df_test,type="class")
Skill = sum(TestPredictions==df_test[[c('class')]])/length(TestPredictions)
print(Skill) #value is 0.87
Python code:
import numpy as np
from sklearn.svm import SVC
Data = np.array([-108.604150711185, -131.880188127745,-18.3017441809734, 32.011639982337, -71.6651360870381, -107.587087751331, 21.316311739316, -36.015324564807, 138.22302265079, 47.9322592065447, -129.007749732555, -150.41808326425, -141.00589707504, -105.912063885407, 76.2956568174239, 141.457541434218, -20.6676395937811, -226.505644333494, -151.229861588686, -160.18717733968, -107.01667849677, -7.52794131287047, -93.1147621027003, 5.59630172385392, 38.741091785708, -32.9061390503546, -78.5031246062325, -9.64080356337477, -54.1430873201472, -108.127067430103, -12.2589074567133, 129.212940940854, 132.670728015743, 107.075153550768, 167.176831103164, -20.6839530330714, 102.677911281291, -109.423698849103, -154.454318421757, 140.52342226202, 110.184351332211, -16.6842057565239, -11.1688984829787, 178.441845032635, 37.0689292040101, 166.610506783818, -79.2764182099804, 99.1136693164655, 82.0929274697289, 15.1752041486536, 178.489001782771, 145.332200036106, -185.977800430997, -90.5440753976243, 78.0459300120412, 144.297553387967, 99.5945824957091, 110.803195137024, 81.3094331750562,-396.825240330405, -166.038928089807, -78.863983688682, 138.309908804212, -148.647304302406, -2.23135233624276, 129.411511929621, -111.664324254549, -96.4151180340831, 129.219227225386, 90.7050615157428, 141.986869866474, 93.0147970463941, 142.807435791073, -75.8426755946232, 122.537973092667, 117.078515092191, 134.166968023265, 90.8512172789568, 146.367129646428, 125.539182526718, -70.485058023267, -46.967575223949, 116.210349687502, -91.2992704167832, 104.052231138142, -114.580693287221, -82.9991067628608, -111.649187979413])
Class = np.array([0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0])
df_training = np.array([Data, Class])
Data = np.array([133.75999742845, 22.9386702890105, -126.959902277009, -116.317935595297, -33.9418594804197, -49.0102540773413, -159.266630498512, -8.92296705690401, 114.328300224712, 66.0706175847251, -154.385344188283, 70.7868284982941, -28.334490887314, 118.755307949047, 154.362286178401, 101.331675190569, 96.2196681290104, 99.5694296232446, 210.160787371823, 65.8474210711036, -125.475676456606, 66.7541385125748, -161.001356357477, -40.1416817172267, 38.6877489907967, -7.12706914419719, -10.3967176519225, -80.6831091111636, 128.604227270616, 75.4219966516171, 184.951786958864, 90.9170782990185, 66.7190886024699, 81.377280661573, -82.4053965286415, -65.6718687269108, 61.1679518726262, 190.532649096311, 199.917670153196, 104.558442558929, 113.747065157369, 106.640501329133,80.593201532054, 75.0176280888154, 155.538654396817, 30.0548798029353, 116.900219512636, 131.431417509576, 33.3308447581156, -121.191534016935, -80.4203785670198, 157.737407847885, 66.5956228628815, 50.8340706561446, -113.713450848071, -18.7787225270887, 113.832326071127, -45.5884280143408, 221.782395098832, 70.1660982367319, 235.005982636939, 80.8180320055801, -74.7107276814795, 133.925782624001, 97.9261686360971, -127.954532027281, 58.9295075974962, 96.1702797891484, -49.6048543914143, -42.1842037639683, -235.694708213157, 13.4862841916787, 126.396462591781, 214.297316240176, 125.148658464391, 84.8887673204376, 78.2717096234718, 139.677936314095, -168.649300541479, 103.40253638232, 69.2727189156141, 153.017155534869, -238.07168745534, -166.929968475244, 113.414489211719,85.5520123243496, 120.582346886614, -214.850084749638, 96.8090523924549])
Class = np.array([1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1])
df_test = np.array([Data, Class])
# train model
clf = SVC(verbose=True, gamma=0.01, kernel='rbf', C=1)
# make predictions
clf.fit(df_training[0].reshape(88,1), df_training[1].reshape(88,1))
TrainingPredictions = clf.predict(df_training[0].reshape(88,1))
TestPredictions = clf.predict(df_test[0].reshape(89,1))
Skill = np.sum(TestPredictions==df_test[1])/float(len(TestPredictions))
print Skill #value is 0.55
This observed difference might come from the fact that in R, svm() scales the data by default (see documentation, page 6).
If you use scikit-learn's StandardScaler, you end up with a result pretty close from the one you obtained with R :
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Data = np.array([-108.604150711185, -131.880188127745,-18.3017441809734, 32.011639982337, -71.6651360870381, -107.587087751331, 21.316311739316, -36.015324564807, 138.22302265079, 47.9322592065447, -129.007749732555, -150.41808326425, -141.00589707504, -105.912063885407, 76.2956568174239, 141.457541434218, -20.6676395937811, -226.505644333494, -151.229861588686, -160.18717733968, -107.01667849677, -7.52794131287047, -93.1147621027003, 5.59630172385392, 38.741091785708, -32.9061390503546, -78.5031246062325, -9.64080356337477, -54.1430873201472, -108.127067430103, -12.2589074567133, 129.212940940854, 132.670728015743, 107.075153550768, 167.176831103164, -20.6839530330714, 102.677911281291, -109.423698849103, -154.454318421757, 140.52342226202, 110.184351332211, -16.6842057565239, -11.1688984829787, 178.441845032635, 37.0689292040101, 166.610506783818, -79.2764182099804, 99.1136693164655, 82.0929274697289, 15.1752041486536, 178.489001782771, 145.332200036106, -185.977800430997, -90.5440753976243, 78.0459300120412, 144.297553387967, 99.5945824957091, 110.803195137024, 81.3094331750562,-396.825240330405, -166.038928089807, -78.863983688682, 138.309908804212, -148.647304302406, -2.23135233624276, 129.411511929621, -111.664324254549, -96.4151180340831, 129.219227225386, 90.7050615157428, 141.986869866474, 93.0147970463941, 142.807435791073, -75.8426755946232, 122.537973092667, 117.078515092191, 134.166968023265, 90.8512172789568, 146.367129646428, 125.539182526718, -70.485058023267, -46.967575223949, 116.210349687502, -91.2992704167832, 104.052231138142, -114.580693287221, -82.9991067628608, -111.649187979413])
Data = scaler.fit_transform(Data)
Class = np.array([0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0])
df_training = np.array([Data, Class])
Data = np.array([133.75999742845, 22.9386702890105, -126.959902277009, -116.317935595297, -33.9418594804197, -49.0102540773413, -159.266630498512, -8.92296705690401, 114.328300224712, 66.0706175847251, -154.385344188283, 70.7868284982941, -28.334490887314, 118.755307949047, 154.362286178401, 101.331675190569, 96.2196681290104, 99.5694296232446, 210.160787371823, 65.8474210711036, -125.475676456606, 66.7541385125748, -161.001356357477, -40.1416817172267, 38.6877489907967, -7.12706914419719, -10.3967176519225, -80.6831091111636, 128.604227270616, 75.4219966516171, 184.951786958864, 90.9170782990185, 66.7190886024699, 81.377280661573, -82.4053965286415, -65.6718687269108, 61.1679518726262, 190.532649096311, 199.917670153196, 104.558442558929, 113.747065157369, 106.640501329133,80.593201532054, 75.0176280888154, 155.538654396817, 30.0548798029353, 116.900219512636, 131.431417509576, 33.3308447581156, -121.191534016935, -80.4203785670198, 157.737407847885, 66.5956228628815, 50.8340706561446, -113.713450848071, -18.7787225270887, 113.832326071127, -45.5884280143408, 221.782395098832, 70.1660982367319, 235.005982636939, 80.8180320055801, -74.7107276814795, 133.925782624001, 97.9261686360971, -127.954532027281, 58.9295075974962, 96.1702797891484, -49.6048543914143, -42.1842037639683, -235.694708213157, 13.4862841916787, 126.396462591781, 214.297316240176, 125.148658464391, 84.8887673204376, 78.2717096234718, 139.677936314095, -168.649300541479, 103.40253638232, 69.2727189156141, 153.017155534869, -238.07168745534, -166.929968475244, 113.414489211719,85.5520123243496, 120.582346886614, -214.850084749638, 96.8090523924549])
Data = scaler.fit_transform(Data)
Class = np.array([1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1])
df_test = np.array([Data, Class])
# train model
clf = SVC(verbose=True, gamma=0.01, kernel='rbf', C=1)
# make predictions
clf.fit(df_training[0].reshape(88,1), df_training[1].reshape(88,1))
TrainingPredictions = clf.predict(df_training[0].reshape(88,1))
TestPredictions = clf.predict(df_test[0].reshape(89,1))
Skill = np.sum(TestPredictions==df_test[1])/float(len(TestPredictions))
print("Skill: "+str(Skill)) #value is 0.84