PCA analysis considering N-less relevant components - python

I am trying to learn the basics of PCA analysis in Python using scikit libraries (in particular sklearn.decomposition and sklearn.preprocessing). The goal is to import data from images into a matrix X (each row is a sample, each column is a feature), then standardize X, use PCA to extract principal components (2 most important, 6 most important....6 less important), project X on these principal components, reverse the previous transformation and plot the result in order to see the difference with respect to the original image/images.
Now let's say that I do not want to consider the 2,3,4... most important principal components but I want to consider the N less relevant components, let's say N=6.
How should the analysis be done?
I mean I can't simply standardize then call PCA().fit_transform and then revert back with inverse_transform() to plot the results.
At the moment I am doing something like this:
X_std = StandardScaler().fit_transform(X) # standardize original data
pca = PCA()
model = pca.fit(X_std) # create model with all components
Xprime = model.components_[range(dim-6, dim, 1),:] # get last 6 PC
And then I stop because I know I should call transform() but I do not understand how to do it...I tried several times withouth being successfull.
Is there someone that can tell me if previous steps are correct and point out the direction to follow?
Thank you very much
EDIT: currently I have adapted this solution as suggested by the first answer to my question:
model = PCA().fit(X_std)
model2pc = model
model2pc.components_[range(2, img_count, 1), :] = 0
Xp_2pc = model2pc.transform(X_std)
Xr_2pc = model2pc.inverse_transform(Xp_2pc)
And then I do the same for 6pc, 60pc, last 6 pc. What I have noticed is that this is very time consuming. I would like to get a model directly extracting the principal components I need (without zeroing out the others) and then perform transform() and inverse_transform() on that with that model.

If you want to ignore all but the last 6 principal components, you can just zero out the ones you don't want to keep.
N = 6
X_std = StandardScaler().fit_transform(X)
pca = PCA()
model = pca.fit(X_std) # create model with all components
model.components_[:-N] = 0
Then, to remove all but the last N components from the data, just do a forward and inverse transform of the data:
Xprime = model.inverse_transform(model.transform(X_std))
Here is an example:
>>> X = np.random.rand(18).reshape(6, 3)
>>> model = PCA().fit(X)
A round-trip transform should give back the original data:
>>> X
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
>>> model.inverse_transform(model.transform(X))
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
Now zero out the first principal component:
>>> model.components_
array([[ 0.22969899, 0.21209762, 0.94986998],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
>>> model.components_[:-2] = 0
>>> model.components_
array([[ 0. , 0. , 0. ],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
The round-trip transform now gives a different result since we've removed the first principal component (which contains the greatest amount of variance):
>>> model.inverse_transform(model.transform(X))
array([[ 0.12742811, -0.01189858, 0.68108405],
[ 0.36513945, 0.33308073, 0.54656949],
[ 0.58029482, 0.33392119, 0.49435263],
[ 0.39987803, 0.35478779, 0.53332196],
[ 0.71114004, 0.56787176, 0.41047233],
[ 0.44000711, 0.16692583, 0.56556581]])

Related

Numpy applying a time interval sequence to a multidimensional ndarray (such as coordinates)

EDIT: added prefix / suffix value to interval arrays to make them the same length as their corresponding data arrays, as per #user1319128 's suggestion and indeed interp does the job. For sure his solution was workable and good. I just couldn't see it because I was tired and stupid.
I am sure this is a fairly mundane application, but so I have failed to find or come up with a way to do this without doing it outside of numpy. Maybe my brain just needs a rest, anyway here is the problem with example and solution requirements.
So I have to arrays with different lengths and I want to apply common time intervals between them to these arrays, so that that the result is I have versions of these arrays that are all the same length and their values relate to each other at the same row (if that makes sense). In the example below I have named this functionality "apply_timeintervals_to_array". The example code:
import numpy as np
from colorsys import hsv_to_rgb
num_xy = 20
num_colors = 12
#xy = np.random.rand(num_xy, 2) * 1080
xy = np.array([[ 687.32758344, 956.05651214],
[ 226.97671414, 698.48071588],
[ 648.59878864, 175.4882185 ],
[ 859.56600997, 487.25205922],
[ 794.43015178, 16.46114312],
[ 884.7166732 , 634.59100322],
[ 878.94218682, 835.12886098],
[ 965.47135726, 542.09202328],
[ 114.61867445, 601.74092126],
[ 134.02663822, 334.27221884],
[ 940.6589034 , 245.43354493],
[ 285.87902276, 550.32600784],
[ 785.00104142, 993.19960822],
[1040.49576307, 486.24009511],
[ 165.59409198, 156.79786175],
[1043.54280058, 313.09073855],
[ 645.62878826, 100.81909068],
[ 625.78003257, 252.17917611],
[1056.77009875, 793.02218098],
[ 2.93152052, 596.9795026 ]])
xy_deltas = np.sum((xy[1:] - xy[:-1])**2, axis=-1)
xy_ti = np.concatenate(([0.0],
(xy_deltas) / np.sum(xy_deltas)))
colors_ti = np.concatenate((np.linspace(0, 1, num_colors),
[1.0]))
common_ti = np.unique(np.sort(np.concatenate((xy_ti,
colors_ti))))
common_colors = (np.array(tuple(hsv_to_rgb(t, 0.9, 0.9) for t
in np.concatenate(([0.0],
common_ti,
[1.0]))))
* 255).astype(int)[1:-1]
common_xy = apply_timeintervals_to_array(common_ti, xy)
So one could then use the common arrays for additional computations or for rendering.
The question is what could accomplish the "apply_timeintervals_to_array" functionality, or alternatively a better way to generate the same data.
I hope this is clear enough, let me know if it isn't. Thank you in advance.
I think , numpy.interp should meet your expectations.For example, If a have an 2d array of length 20 , and would like to interpolate at different common_ti values ,whose length is 30 , the code would be as follows.
xy = np.arange(0,400,10).reshape(20,2)
xy_ti = np.arange(20)/19
common_ti = np.linspace(0,1,30)
x=np.interp(common_ti,xy_ti,xy[:,0]) # interpolate the first column
y=np.interp(common_ti,xy_ti,xy[:,1]) #interpolate the second column

How to use sklearn's Matrix factorization to predict new users' recommendation scores

I'm trying to use sklearn.decomposition.NMF to a matrix R that contains data on how users rated items to predict user ratings for items that they have not yet seen.
the matrix's rows being users, columns being items, and values being scores, with 0 score meaning that the user did not rate this item yet.
Now with the code below I have only managed to get the two matrices that when multiplied together give the original matrix back.
import numpy
R = numpy.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
])
from sklearn.decomposition import NMF
model = NMF(n_components=4)
A = model.fit_transform(R)
B = model.components_
n = numpy.dot(A, B)
print(n)
Problem is, that the model does not predict new values in place of 0's, that would be the predicted scores, but instead recreates the matrix as was.
How do I get the model to predict user scores in place of my original matrix's zeros?
That is what is supposed to happen.
However in most of the cases you are not going to have number of components so similar to the number of products and/or customers.
So for instance considering 2 components
model = NMF(n_components=2)
A = model.fit_transform(R)
B = model.components_
R_estimated = np.dot(A, B)
print(np.sum(R-R_estimated))
-1.678873127048393
R_estimated
array([[5.2558264 , 1.99313836, 0. , 1.45512772],
[3.50429478, 1.32891458, 0. , 0.9701988 ],
[1.31294288, 0.94415991, 1.94956896, 3.94609389],
[0.98129195, 0.72179987, 1.52759811, 3.0788454 ],
[0. , 0.65008935, 2.84003662, 5.21894555]])
You can see in this case that many of the previous zeros are now other numbers you could use. Here for a bit of context https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems).
How to select n_components?
I think the question above is answered, but in case the complete procedure could be something as below.
For that we will need to know a the values in R that are real and we want to focus to predict.
In many cases 0 in R are those new cases / scenarios.
It is common to update R with the averages for products or customers and then calculate the decomposition for selecting the ideal n_components. For selection of they maybe a criteria or more to calculate the advantage in a test sample
Create R_with_Averages
Model selection:
2.1) Split R_with_Averages Test and Training
2.2) Compare among different n_components (from 1 and arbitrary number) using a metric (in which you only consider real evaluations in R)
2.3) Select the best model --> best n_components
Predict with the best model.
Perhaps good to see:
Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl, J. (2000). Application of Dimensionality Reduction in Recommender System—A Case Study. In ACM WebKDD’00 (Web-mining for ECommerce Workshop). This give you and overall view.
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/. Example with code very similar.
sklearn's implementation of NMF does not seem to support missing values (Nans, here 0 values basically represent unknown ratings corresponding to new users), refer to this issue. However, we can use suprise's NMF implementation, as shown in the following code:
import numpy as np
import pandas as pd
from surprise import NMF, Dataset, Reader
R = np.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
], dtype=np.float)
R[R==0] = np.nan
print(R)
# [[ 5. 3. nan 1.]
# [ 4. nan nan 1.]
# [ 1. 1. nan 5.]
# [ 1. nan nan 4.]
# [nan 1. 5. 4.]]
df = pd.DataFrame(data=R, index=range(R.shape[0]), columns=range(R.shape[1]))
df = pd.melt(df.reset_index(), id_vars='index', var_name='items', value_name='ratings').dropna(axis=0)
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[['index', 'items', 'ratings']], reader)
k = 2
algo = NMF(n_factors=k)
trainset = data.build_full_trainset()
algo.fit(trainset)
predictions = algo.test(trainset.build_testset()) # predict the known ratings
R_hat = np.zeros_like(R)
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
predictions = algo.test(trainset.build_anti_testset()) # predict the unknown ratings
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
print(R_hat)
# [[4.40762528 2.62138084 3.48176319 0.91649316]
# [3.52973408 2.10913555 2.95701406 0.89922637]
# [0.94977826 0.81254138 4.98449755 4.34497549]
# [0.89442186 0.73041578 4.09958967 3.50951819]
# [1.33811051 0.99007556 4.37795636 3.53113236]]
The NMF implementation is as per the [NMF:2014] paper as described here and shown below:
Note that, here the optimization is performed using the known ratings only, resulting in the predicted values of the known ratings being close to the true ratings (but the predicted values for the unknown ratings are not in general close to 0, as expected).
Again, as usual, we can find the number of factors k using cross-validation.

Randomly sample from multiple tf.data.Datasets in Tensorflow

suppose I have N tf.data.Datasets and a list of N probabilities (summing to 1), now I would like to create dataset such that the examples are sampled from the N datasets with the given probabilities.
I would like this to work for arbitrary probabilities -> simple zip/concat/flatmap with fixed number of examples from each dataset is probably not what I am looking for.
Is it possible to do this in TF? Thanks!
As of 1.12, tf.data.experimental.sample_from_datasets provides this functionality:
https://www.tensorflow.org/api_docs/python/tf/data/experimental/sample_from_datasets
EDIT: Looks like in earlier versions this can be accessed by tf.contrib.data.sample_from_datasets
if p is a Tensor of probabilities (or unnormalized relative probabilities) where p[i] is the probability that dataset i is chosen, you can use tf.multinomial in conjunction with tf.contrib.data.choose_from_datasets:
# create some datasets and their unnormalized probability of being chosen
datasets = [
tf.data.Dataset.from_tensors(['a']).repeat(),
tf.data.Dataset.from_tensors(['b']).repeat(),
tf.data.Dataset.from_tensors(['c']).repeat(),
tf.data.Dataset.from_tensors(['d']).repeat()]
p = [1., 2., 3., 4.] # unnormalized
# random choice function
def get_random_choice(p):
choice = tf.multinomial(tf.log([p]), 1)
return tf.cast(tf.squeeze(choice), tf.int64)
# assemble the "choosing" dataset
choice_dataset = tf.data.Dataset.from_tensors([0]) # create a dummy dataset
choice_dataset = choice_dataset.map(lambda x: get_random_choice(p)) # populate it with random choices
choice_dataset = choice_dataset.repeat() # repeat
# obtain your combined dataset, assembled randomly from source datasets
# with the desired selection frequencies.
combined_dataset = tf.contrib.data.choose_from_datasets(datasets, choice_dataset)
Note that the dataset needs to be initialized (you can't use a simple make_one_shot_iterator):
choice_iterator = combined_dataset.make_initializable_iterator()
choice = choice_iterator.get_next()
with tf.Session() as sess:
sess.run(choice_iterator.initializer)
print ''.join([sess.run(choice)[0] for _ in range(20)])
>> ddbcccdcccbbddadcadb
I think you can use tf.contrib.data.rejection_resample to achieve target distribution.

Extracting parameters from astropy.modeling Gaussian2D

I have managed to use astropy.modeling to model a 2D Gaussian over my image and the parameters it has produced to fit the image seem reasonable. However, I need to run the 2D Gaussian over thousands of images because we are interested in examining the mean x and y of the model and also the x and y standard deviations over our images. The model output looks like this:
m2
<Gaussian2D(amplitude=0.0009846091239480168, x_mean=30.826676737477573, y_mean=31.004045976953222, x_stddev=2.5046722491074536, y_stddev=3.163048479350727, theta=-0.0070295894129793896)>
I can also tell you this:
type(m2)
<class 'astropy.modeling.functional_models.Gaussian2D'>
Name: Gaussian2D
Inputs: (u'x', u'y')
Outputs: (u'z',)
Fittable parameters: ('amplitude', 'x_mean', 'y_mean', 'x_stddev', 'y_stddev', 'theta')
What I need is a method to extract the parameters of the model, namely:
x_mean
y_mean
x_stddev
y_stddev
I am not familiar with this form output so I am really stuck on how to extract the parameters.
The models have attributes you can access:
from astropy.modeling import models
g2d = models.Gaussian2D(1,2,3,4,5)
g2d.amplitude.value # 1.0
g2d.x_mean.value # 2.0
g2d.y_mean.value # 3.0
g2d.x_stddev.value # 4.0
g2d.y_stddev.value # 5.0
You need to extract these values after you fitted the model but you can access them in the same way: .<name>.value.
You can also extract them in one go but then you need to keep track which parameter is in which position:
g2d.parameters # array([ 1., 2., 3., 4., 5., 0.])
# Amplitude
g2d.parameters[0] # 1.0
# x-mean
g2d.parameters[1] # 2.0
# ...
An alternative approach is to use estimate_line_parameters. The docs for it aren't very clear in this area (or so I think). If the problem is getting starting parameters for the Gaussians for the lines, it is a good place to start.
To approach it this way:
from specutils.spectra import Spectrum2D
from specutils.fitting import estimate_line_parameters
...
e1 = estimate_line_parameters(spectrum, models.Gaussian2D())
a = e1.amplitude.value
b = e1.x_mean.value
c = e1.y_mean.value
d = x_stddev.value
e = y_stddev.value
estimate_line_parameters gives the results loads of decimal places, so if you are trying to estimate starting values you would probably want to use round(value_name, n) where n is the number of dec places that you feel are appropriate.
NOTE that a,b,c etc are returned as values and don't preserve units. so you will also need:
from astropy import units as u
and then (e.g.) a = e1.amplitude.value*u.(flux_units) where flux_units is Jy or something similar and/or scaled versions thereof.
Of course all this assumes that you have got your background sufficiently well subtracted...

Rotation/translation of vtk 3D image with interpolation (python)

I have 2 matrix:
#for example
rotation = matrix([[ 0.61782155, 0.78631834, 0. ],
[ 0.78631834, -0.61782155, 0. ],
[ 0. , 0. , -1. ]])
translation = matrix([[-0.33657291],
[ 1.04497454],
[ 0. ]])
vtkinputpath = "/hello/world/vtkfile.vtk"
vtkoutputpath = "/hello/world/vtkrotatedfile.vtk"
interpolation = "linear"
I have a vtk file which contains 3D image and I want to create a function in python to rotate/translate with interpolation it.
import vtk
def rotate(vtkinputpath, vtkoutputpath, rotation, translation, interpolation):
...
I'm trying to take inspiration from the transformJ plugin sources (see here to understand how it works)
I wanted to use vtk.vtkTransform but I don't really understand how it works: these examples are not close enough of what I want to do. This is what I did with that:
reader = vtk.vtkXMLImageDataReader()
reader.SetFileName(vtkinputpath)
reader.Update()
transform = reader.vtkTransform()
transform.RotateX(rotation[0])
transform.RotateY(rotation[1])
transform.RotateZ(rotation[2])
transform.Translate(translation[0], translation[1], translation[2])
#and I don't know how I can choose the parameter of the interpolation
But that cannot work...
I saw here that the function RotateWXYZ() exists:
# create a transform that rotates the cone
transform = vtk.vtkTransform()
transform.RotateWXYZ(45,0,1,0)
transformFilter=vtk.vtkTransformPolyDataFilter()
transformFilter.SetTransform(transform)
transformFilter.SetInputConnection(source.GetOutputPort())
transformFilter.Update()
But I don't understand what the lines do.
My main problem is that I cannot find the vtk documentation for Python...
Can you advise me a documentation website for vtk in Python ? Or can you explain me at least how vtktransform (rotateWXYZ()) work ?
Please, I'm totally lost, nothing works.
I'm not sure there is specific Python documentation, but this can be useful to understand how RotateWXYZ works: http://www.vtk.org/doc/nightly/html/classvtkTransform.html#a9a6bcc6b824fb0a9ee3a9048aa6b262c
To create the transform you want you can combine rotation and translation matrices into a 4x4 matrix, to do this we put the rotation matrix in columns and rows 0,1 and 2, we put the translation vector in the right column, the bottom row is 0,0,0,1. Here's some more info about this. For example:
0.61782155 0.78631834 0 -0.33657291
0.78631834 -0.61782155 0 1.04497454
0 0 -1 0
0 0 0 1
Then you can directly set the matrix to vtkTransform using SetMatrix:
matrix = [0.61782155,0.78631834,0,-0.33657291,0.78631834,-0.61782155,0,1.04497454,0,0,-1,0,0,0,0,1]
transform.SetMatrix(matrix)
EDIT: Edited to complete the values in the matrix variable.

Categories

Resources