So I have a data set of 700 texts annotated by difficulty levels. Each text has 150 features:
feature_names = ['F1','F2','F3'...] shape (1, 150)
features_x = ['0.1','0,765', '0.543'...] shape (700, 150)
correct_answers_y = ['1','2','4'...] shape (1,700)
I want to use PCA to find out the most informative sets of features, something like:
Component1 = 0,76*F1+0.11*F4-0.22*F7
How can I do so? The code from sklearn user gide have some numbers as output, but I don`t understand how to unterpret them.
fit_xy = pca.fit(features_x,correct_answers_y)
array([ 4.01783322e-01, 1.98421989e-01, 3.08468655e-01,
4.28813755e-02, ...])
Not sure where that array comes from, but it looks like the output of explained_variance_ or explained_variance_ratio_ attributes. They are as they say; explained variance and explained variance ratio compared to your data. Usually when doing a PCA you're defining a minimum of ratio of variance you want to keep from the data.
Lets say you want to keep at least 90% of the variance in your data. Here's code to find how many principle components (n_components parameter in PCA) you need:
pca_cumsum = pca.explained_variance_ratio_.cumsum()
pca_cumsum
>> np.array([.54, .79, .89, .91, .97, .99, 1])
np.argmax(pca_cumsum >= 0.9)
>> 3
And as desertnaut said; labels will be ignored, as it is not used in PCA.
Related
I've seen that it's common practice to delete input features that demonstrate co-linearity (and leave only one of them).
However, I've just completed a course on how a linear regression model will give different weights to different features, and I've thought that maybe the model will do better than us giving a low weight to less necessary features instead of completely deleting them.
To try to solve this doubt myself, I've created a small dataset resembling a x_squared function and applied two linear regression models using Python:
A model that keeps only the x_squared feature
A model that keeps both the x and x_squared features
The results suggest that we shouldn't delete features, and let the model decide the best weights instead. However, I would like to ask the community if the rationale of my exercise is right, and whether you've found this doubt in other places.
Here's my code to generate the dataset:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate the data
all_Y = [10, 3, 1.5, 0.5, 1, 5, 8]
all_X = range(-3, 4)
all_X_2 = np.square(all_X)
# Store the data into a dictionary
data_dic = {"x": all_X, "x_2": all_X_2, "y": all_Y}
# Generate a dataframe
df = pd.DataFrame(data=data_dic)
# Display the dataframe
display(df)
which produces this:
and this is the code to generate the ML models:
# Create the lists to iterate over
ids = [1, 2]
features = [["x_2"], ["x", "x_2"]]
titles = ["$x^{2}$", "$x$ and $x^{2}$"]
colors = ["blue", "green"]
# Initiate figure
fig = plt.figure(figsize=(15,5))
# Iterate over the necessary lists to plot results
for i, model, title, color in zip(ids, features, titles, colors):
# Initiate model, fit and make predictions
lr = LinearRegression()
lr.fit(df[model], df["y"])
predicted = lr.predict(df[model])
# Calculate mean squared error of the model
mse = mean_squared_error(all_Y, predicted)
# Create a subplot for each model
plt.subplot(1, 2, i)
plt.plot(df["x"], predicted, c=color, label="f(" + title + ")")
plt.scatter(df["x"], df["y"], c="red", label="y")
plt.title("Linear regression using " + title + " --- MSE: " + str(round(mse, 3)))
plt.legend()
# Display results
plt.show()
which generate this:
What do you think about this issue? This difference in the Mean Squared Error can be of high importance on certain contexts.
Because x and x^2 are not linear anymore, that is why deleting one of them is not helping the model. The general notion for regression is to delete those features which are highly co-linear (which is also highly correlated)
So x2 and y are highly correlated and you are trying to predict y with x2? A high correlation between predictor variable and response variable is usually a good thing - and since x and y are practically uncorrelated you are likely to "dilute" your model and with that get worse model performance.
(Multi-)Colinearity between the predicor variables themselves would be more problematic.
import numpy as np
from sklearn.feature_selection import SelectFwe, f_regression
from sklearn import __version__ #1.0.2
I have this example dataset:
X = np.array([
[3,0,2,1],
[7,3,0,5],
[4,2,5,1],
[6,2,7,3],
[3,2,5,2],
[6,1,1,4]
])
y = np.array([2,9,2,4,5,9])
The f_regression(X, y) function returns two arrays:
(array([ 4.68362124, 0.69456469, 2.59714175, 27.64721141]),
array([0.09643779, 0.45148854, 0.18234859, 0.00626275]))
The first one contains the F-statistic for the 4 features of my dateset, the second one contains the p-values associated with the F-statistic.
Now suppose I want to extract the features with a p-value lower than 0.15; what I expect is that the first and last features are selected. I would like to use SelectFwe (here the documentation) to perform this step, so:
SelectFwe(f_regression, alpha=.15).fit(X, y).get_support()
Unfortunately it returns array([False, False, False, True]), meaning that only the last feature is selected.
Why does it happen? Did I misunderstand how SelectFwe works? Probably the following picture is helpful:
The code I used to produce the plot:
plt.plot(
np.linspace(0,1,101),
[SelectFwe(f_regression, alpha=alpha).fit(X, y).get_support().sum() for alpha in np.linspace(0,1,101)]
)
plt.xlabel("alpha")
plt.ylabel("selected features")
plt.show()
In the source, the alpha is divided by the number of features:
def _get_support_mask(self):
check_is_fitted(self)
return self.pvalues_ < self.alpha / len(self.pvalues_)
This is because the class is considering "family-wise error" rate, which is
the probability of making one or more false discoveries
(wikipedia, my emph). You can use instead SelectFpr, "false positive rate" test, which works exactly the same but doesn't divide by the number of features. See also Issue1007.
im trying to replicate an experiment on a paper using SVM, to increment my learning/knownledge on machine learning. In this paper, the author extracts the features and chooses the feature sizes. He, then shows a table where F represents the size of the feature vector and N represents the face images
He then works with F >= 9 and N >= 15 parameters.
Now, what i want to do is to actually grab the features i extract as he does in the paper.
Basically, this is how i extract the features:
def load_image_files(fullpath, dimension=(64, 64)):
descr = "A image classification dataset"
images = []
flat_data = []
target = []
dimension=(64, 64)
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
for person in os.listdir(path):
personfolder = os.path.join(path, person)
for imgname in os.listdir(personfolder):
class_num = CATEGORIES.index(category)
fullpath = os.path.join(personfolder, imgname)
img_resized = resize(skimage.io.imread(fullpath), dimension, anti_aliasing=True, mode='reflect')
flat_data.append(img_resized.flatten())
images.append(skimage.io.imread(fullpath))
target.append(class_num)
flat_data = np.array(flat_data)
target = np.array(target)
images = np.array(images)
print(CATEGORIES)
return Bunch(data=flat_data,
target=target,
target_names=category,
images=images,
DESCR=descr)
How do i select the amount of features extracted and stored? or how do i manually store a vector with the amount of features that i need? For instance a feature vector of size 9
I'm trying to separate my features this way:
X_train, X_test, y_train, y_test = train_test_split(
image_dataset.data, image_dataset.target, test_size=0.3,random_state=109)
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X_train, y_train)
print(model.feature_importances_)
Though, my output is:
[0. 0. 0. ... 0. 0. 0.]
for SVM classification, im trying to use OneVsRestClassifier
model_to_set = OneVsRestClassifier(SVC(kernel="poly"))
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly", "rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters)
model_tunning
model_tunning.fit(X_train, y_train)
prediction = model_tunning.best_estimator_.predict(X_test)
Then, once i call prediction, i get:
Out[29]:
array([1, 0, 4, 2, 1, 3, 3, 0, 1, 1, 3, 4, 1, 1, 0, 3, 2, 2, 2, 0, 4, 2,
2, 4])
So you've got two arrays of image information (one unprocessed, the other resized and flattened) as well as a list of corresponding class values (which we usually call labels). There are currently 2 things not quite right with the setup, however:
1) What's missing here are multiple features - these might include specific arrays from data associated with feature extraction from morphological/computer vision processes of your images, or they may be ancillary data like a list of preferences, behaviors, purchases. Basically, anything that can act as an array in either a numerical or categorical format. Technically speaking, your resized images are a second feature, but I don't think this will add much if any improvement in model performance.
2) target_names=category in your function return will store the last iteration pf category in CATEGORIES. I don't know if this is what you want.
Going back to your table, N would refer to the number of images in the dataset, and F would be the number of corresponding feature arrays associated with that image. By way of example, let's say we have fifty individual wines and five features (colour, taste, alcohol content, pH, optical density). N of 5 would be five of those wines, and F of 2 would be, say, colour, taste.
If I had to guess at what your features would be, they would in fact be a single feature - the image data itself. Looking at your data structure, every label/category you have will have multiple individuals (people) each with multiple examples of images of that person. Note that multiple individuals are not separate features - the way you're structuring the data, the individuals are grouped together under a single category.
So, where to from here? Without knowing what paper you're reading it's hard to suggest what to do, but I would go back and see if you can perhaps provide us with more information about the problem.
I am looking to extract and identify digits from an image.
I've read a lot about digits recogntion but did not find anything on adding rules to select the only the digits we are interest in.
The rules would be "quite simple" I want to extract only the digits surrounded with a blue pen for example.
Not waiting for the entire solution here but more a axes of researches or links to similir problem.
I am quite familiar with Neural Networks and intend to use one on this. But I cannot see how filter out only the surrounded digits.
Here a sample of the picture. Image the same schema but several times on a picture.
I think you have three ways of operating. And maybe you do not need to get that far! For now, we will only look for which one has been selected.
Case 1: You can try to use the hough transform for circles to find the circles present in the image.
% Solution 1 (practically a perfect cicle, use hough circle transform to find circles)
im = imread('https://i.stack.imgur.com/L7cE1.png');
[centers, radii, metric] = imfindcircles(im, [10, 60]);
imshow(im); viscircles(centers, radii,'EdgeColor','r');
Case 2: You can work in the space of the blue color and eliminate achromatic colors to segment the areas that interest you (If you add margins you can work correctly).
% Solution 2 (ALWAYS is blue, read only rgB channel and delete achromatic)
b = im(:, :, 3) & (std(double(im(:, :, :)), [], 3) > 5);
bw = imfill(b,'holes');
stats = regionprops('table', bw, 'Centroid', 'MajorAxisLength','MinorAxisLength')
imshow(im); viscircles(stats.Centroid, stats.MajorAxisLength / 2,'EdgeColor','r');
Case 3: You can generate a dataset together with positive cases and others with negative ones. And train a neural network with 10 outputs that indicate in each one if there is or not crossed out (sigmoid output). The good thing about this type of model is that you should not do an OCR later.
import keras
from keras.layers import *
from keras.models import Model
from keras.losses import mean_squared_error
from keras.applications.mobilenet import MobileNet
def model():
WIDTH, HEIGHT = 128, 128
mobile_input = Input(shape=(WIDTH, HEIGHT, 3))
alpha = 0.25 # 0.25, 0.5, 1
shape = (1, 1, int(1024 * alpha))
dropout = 0.1
input_ = Input(shape=(WIDTH, HEIGHT, 3))
mobile_model = MobileNet(input_shape=(WIDTH, HEIGHT, 3),
alpha= alpha,
include_top=False,
dropout = dropout,
pooling='avg')
base_model = mobile_model(mobile_input)
x = Reshape(shape, name='reshape_1')(base_model)
x_gen = Dropout(dropout, name='dropout')(x)
x = Conv2D(10, (1, 1), padding='same')(x_gen)
x = Activation('sigmoid')(x)
output_detection = Reshape((10,), name='output_mark_detection')(x)
"""x = Conv2D(2 * 10, (1, 1), padding='same')(x_gen)
x = Activation('sigmoid')(x)
output_position = Reshape((2 * 10, ), name='output_mark_position')(x)
output = Concatenate(axis=-1)([output_detection, output_position])
"""
model = Model(name="mark_net", inputs=mobile_input, outputs=output_detection)
It depends on your problem, the first cases can serve you. In case of having different conditions of lighting, rotation, scaling, etc. I advise you to go directly to neural networks, you can create many "artificial" examples:
You can generate an artificial dataset by adding distorted
circles (take a normal circle apply random
affine transformations, add noise, change a little the blue color,
the line, etc).
Then you paste the randomly circle in each number and
generate the dataset indicating which numbers are marked.
Once "stuck on the paper" you can apply the data augmentation again
to make it look more real.
You can break the problem in two simpler sub-problems: you could train a first neural network to recognize circles and isolate them. Once you did, you can then train a second neural network to recognize the digits within the subsection you isolated. Hope this helps.
I have movielens dataset which I want to apply PCA on it, but sklearn PCA function dose not seems to do it correctly.
I have 718*8913 matrix which rows indicate the users and columns indicate movies
here is my python code :
Load movie names and movie ratings
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)
def replace_name(x):
return movies[movies['movieId']==x].title.values[0]
ratings.movieId = ratings.movieId.map(replace_name)
M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
df1 = M.replace(np.nan, 0, regex=True)
Standardizing
X_std = StandardScaler().fit_transform(df1)
Apply PCA
pca = PCA()
result = pca.fit_transform(X_std)
print result.shape
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()
I did't set any component number so I expect that PCA return 718*8913 matrix in new dimension but pca result size is 718*718 and pca.explained_variance_ratio_ size is 718, and sum of all members of it is 1, but how this is possible!!!
I have 8913 features and it return only 718 and sum of variance of them is equal to 1 can any one explain what is wrong here ?
my plot picture result:
As you can see in the above picture it just contain 718 component and sum of it is 1 but I have 8913 features where they gone?
Test with smaller example
I even try with scikit learn PCA example which can be found in documentation page of pca Here is the Link I change the example and just increase the number of features
import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
X = np.array([[-1, -1,3,4,-1, -1,3,4], [-2, -1,5,-1, -1,3,4,2], [-3, -2,1,-1, -1,3,4,1],
[1, 1,4,-1, -1,3,4,2], [2, 1,0,-1, -1,3,4,2], [3, 2,10,-1, -1,3,4,10]])
ipca = PCA(n_components = 7)
print (X.shape)
ipca.fit(X)
result = ipca.transform(X)
print (result.shape);
and in this example we have 6 sample and 8 feauters I set the n_components to 7 but the result size is 6*6.
I think when the number of features is bigger than number of samples the maximum number of components scikit learn pca will return is equal to number of samples
See the documentation on PCA.
Because you did not pass an n_components parameter to PCA(), sklearn uses min(n_samples, n_features) as the value of n_components, which is why you get a reduced feature set equal to n_samples.
I believe your variance is equal to 1 because you didn't set the n_components, from the documentation:
If n_components is not set then all components are stored and the sum
of explained variances is equal to 1.0.