I have managed to use astropy.modeling to model a 2D Gaussian over my image and the parameters it has produced to fit the image seem reasonable. However, I need to run the 2D Gaussian over thousands of images because we are interested in examining the mean x and y of the model and also the x and y standard deviations over our images. The model output looks like this:
m2
<Gaussian2D(amplitude=0.0009846091239480168, x_mean=30.826676737477573, y_mean=31.004045976953222, x_stddev=2.5046722491074536, y_stddev=3.163048479350727, theta=-0.0070295894129793896)>
I can also tell you this:
type(m2)
<class 'astropy.modeling.functional_models.Gaussian2D'>
Name: Gaussian2D
Inputs: (u'x', u'y')
Outputs: (u'z',)
Fittable parameters: ('amplitude', 'x_mean', 'y_mean', 'x_stddev', 'y_stddev', 'theta')
What I need is a method to extract the parameters of the model, namely:
x_mean
y_mean
x_stddev
y_stddev
I am not familiar with this form output so I am really stuck on how to extract the parameters.
The models have attributes you can access:
from astropy.modeling import models
g2d = models.Gaussian2D(1,2,3,4,5)
g2d.amplitude.value # 1.0
g2d.x_mean.value # 2.0
g2d.y_mean.value # 3.0
g2d.x_stddev.value # 4.0
g2d.y_stddev.value # 5.0
You need to extract these values after you fitted the model but you can access them in the same way: .<name>.value.
You can also extract them in one go but then you need to keep track which parameter is in which position:
g2d.parameters # array([ 1., 2., 3., 4., 5., 0.])
# Amplitude
g2d.parameters[0] # 1.0
# x-mean
g2d.parameters[1] # 2.0
# ...
An alternative approach is to use estimate_line_parameters. The docs for it aren't very clear in this area (or so I think). If the problem is getting starting parameters for the Gaussians for the lines, it is a good place to start.
To approach it this way:
from specutils.spectra import Spectrum2D
from specutils.fitting import estimate_line_parameters
...
e1 = estimate_line_parameters(spectrum, models.Gaussian2D())
a = e1.amplitude.value
b = e1.x_mean.value
c = e1.y_mean.value
d = x_stddev.value
e = y_stddev.value
estimate_line_parameters gives the results loads of decimal places, so if you are trying to estimate starting values you would probably want to use round(value_name, n) where n is the number of dec places that you feel are appropriate.
NOTE that a,b,c etc are returned as values and don't preserve units. so you will also need:
from astropy import units as u
and then (e.g.) a = e1.amplitude.value*u.(flux_units) where flux_units is Jy or something similar and/or scaled versions thereof.
Of course all this assumes that you have got your background sufficiently well subtracted...
Related
I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.
One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered
I am trying to learn the basics of PCA analysis in Python using scikit libraries (in particular sklearn.decomposition and sklearn.preprocessing). The goal is to import data from images into a matrix X (each row is a sample, each column is a feature), then standardize X, use PCA to extract principal components (2 most important, 6 most important....6 less important), project X on these principal components, reverse the previous transformation and plot the result in order to see the difference with respect to the original image/images.
Now let's say that I do not want to consider the 2,3,4... most important principal components but I want to consider the N less relevant components, let's say N=6.
How should the analysis be done?
I mean I can't simply standardize then call PCA().fit_transform and then revert back with inverse_transform() to plot the results.
At the moment I am doing something like this:
X_std = StandardScaler().fit_transform(X) # standardize original data
pca = PCA()
model = pca.fit(X_std) # create model with all components
Xprime = model.components_[range(dim-6, dim, 1),:] # get last 6 PC
And then I stop because I know I should call transform() but I do not understand how to do it...I tried several times withouth being successfull.
Is there someone that can tell me if previous steps are correct and point out the direction to follow?
Thank you very much
EDIT: currently I have adapted this solution as suggested by the first answer to my question:
model = PCA().fit(X_std)
model2pc = model
model2pc.components_[range(2, img_count, 1), :] = 0
Xp_2pc = model2pc.transform(X_std)
Xr_2pc = model2pc.inverse_transform(Xp_2pc)
And then I do the same for 6pc, 60pc, last 6 pc. What I have noticed is that this is very time consuming. I would like to get a model directly extracting the principal components I need (without zeroing out the others) and then perform transform() and inverse_transform() on that with that model.
If you want to ignore all but the last 6 principal components, you can just zero out the ones you don't want to keep.
N = 6
X_std = StandardScaler().fit_transform(X)
pca = PCA()
model = pca.fit(X_std) # create model with all components
model.components_[:-N] = 0
Then, to remove all but the last N components from the data, just do a forward and inverse transform of the data:
Xprime = model.inverse_transform(model.transform(X_std))
Here is an example:
>>> X = np.random.rand(18).reshape(6, 3)
>>> model = PCA().fit(X)
A round-trip transform should give back the original data:
>>> X
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
>>> model.inverse_transform(model.transform(X))
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
Now zero out the first principal component:
>>> model.components_
array([[ 0.22969899, 0.21209762, 0.94986998],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
>>> model.components_[:-2] = 0
>>> model.components_
array([[ 0. , 0. , 0. ],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
The round-trip transform now gives a different result since we've removed the first principal component (which contains the greatest amount of variance):
>>> model.inverse_transform(model.transform(X))
array([[ 0.12742811, -0.01189858, 0.68108405],
[ 0.36513945, 0.33308073, 0.54656949],
[ 0.58029482, 0.33392119, 0.49435263],
[ 0.39987803, 0.35478779, 0.53332196],
[ 0.71114004, 0.56787176, 0.41047233],
[ 0.44000711, 0.16692583, 0.56556581]])
suppose I have N tf.data.Datasets and a list of N probabilities (summing to 1), now I would like to create dataset such that the examples are sampled from the N datasets with the given probabilities.
I would like this to work for arbitrary probabilities -> simple zip/concat/flatmap with fixed number of examples from each dataset is probably not what I am looking for.
Is it possible to do this in TF? Thanks!
As of 1.12, tf.data.experimental.sample_from_datasets provides this functionality:
https://www.tensorflow.org/api_docs/python/tf/data/experimental/sample_from_datasets
EDIT: Looks like in earlier versions this can be accessed by tf.contrib.data.sample_from_datasets
if p is a Tensor of probabilities (or unnormalized relative probabilities) where p[i] is the probability that dataset i is chosen, you can use tf.multinomial in conjunction with tf.contrib.data.choose_from_datasets:
# create some datasets and their unnormalized probability of being chosen
datasets = [
tf.data.Dataset.from_tensors(['a']).repeat(),
tf.data.Dataset.from_tensors(['b']).repeat(),
tf.data.Dataset.from_tensors(['c']).repeat(),
tf.data.Dataset.from_tensors(['d']).repeat()]
p = [1., 2., 3., 4.] # unnormalized
# random choice function
def get_random_choice(p):
choice = tf.multinomial(tf.log([p]), 1)
return tf.cast(tf.squeeze(choice), tf.int64)
# assemble the "choosing" dataset
choice_dataset = tf.data.Dataset.from_tensors([0]) # create a dummy dataset
choice_dataset = choice_dataset.map(lambda x: get_random_choice(p)) # populate it with random choices
choice_dataset = choice_dataset.repeat() # repeat
# obtain your combined dataset, assembled randomly from source datasets
# with the desired selection frequencies.
combined_dataset = tf.contrib.data.choose_from_datasets(datasets, choice_dataset)
Note that the dataset needs to be initialized (you can't use a simple make_one_shot_iterator):
choice_iterator = combined_dataset.make_initializable_iterator()
choice = choice_iterator.get_next()
with tf.Session() as sess:
sess.run(choice_iterator.initializer)
print ''.join([sess.run(choice)[0] for _ in range(20)])
>> ddbcccdcccbbddadcadb
I think you can use tf.contrib.data.rejection_resample to achieve target distribution.
I want to use sklearn.mixture.GaussianMixture to store a gaussian mixture model so that I can later use it to generate samples or a value at a sample point using score_samples method. Here is an example where the components have the following weight, mean and covariances
import numpy as np
weights = np.array([0.6322941277066596, 0.3677058722933399])
mu = np.array([[0.9148052872961359, 1.9792961751316835],
[-1.0917396392992502, -0.9304220945910037]])
sigma = np.array([[[2.267889129267119, 0.6553245618368836],
[0.6553245618368835, 0.6571014653342457]],
[[0.9516607767206848, -0.7445831474157608],
[-0.7445831474157608, 1.006599716443763]]])
Then I initialised the mixture as follow
from sklearn import mixture
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
Finally I tried to generate a sample based on the parameters which resulted in an error:
x = gmix.sample(1000)
NotFittedError: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
As I understand GaussianMixture is intended to fit a sample using a mixture of Gaussian but is there a way to provide it with the final values and continue from there?
You rock, J.P.Petersen!
After seeing your answer I compared the change introduced by using fit method. It seems the initial instantiation does not create all the attributes of gmix. Specifically it is missing the following attributes,
covariances_
means_
weights_
converged_
lower_bound_
n_iter_
precisions_
precisions_cholesky_
The first three are introduced when the given inputs are assigned. Among the rest, for my application the only attribute that I need is precisions_cholesky_ which is cholesky decomposition of the inverse covarinace matrices. As a minimum requirement I added it as follow,
gmix.precisions_cholesky_ = np.linalg.cholesky(np.linalg.inv(sigma)).transpose((0, 2, 1))
It seems that it has a check that makes sure that the model has been trained. You could trick it by training the GMM on a very small data set before setting the parameters. Like this:
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.fit(rand(10, 2)) # Now it thinks it is trained
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
x = gmix.sample(1000) # Should work now
To understand what is happening, what GaussianMixture first checks that it has been fitted:
self._check_is_fitted()
Which triggers the following check:
def _check_is_fitted(self):
check_is_fitted(self, ['weights_', 'means_', 'precisions_cholesky_'])
And finally the last function call:
def check_is_fitted(estimator, attributes, msg=None, all_or_any=all):
which only checks that the classifier already has the attributes.
So in short, the only thing you have missing to have it working (without having to fit it) is to set precisions_cholesky_ attribute:
gmix.precisions_cholesky_ = 0
should do the trick (can't try it so not 100% sure :P)
However, if you want to play safe and have a consistent solution in case scikit-learn updates its contrains, the solution of #J.P.Petersen is probably the best way to go.
As a slight alternative to #hashmuke's answer, you can use the precision computation that is used inside GaussianMixture directly:
import numpy as np
from scipy.stats import invwishart as IW
from sklearn.mixture import GaussianMixture as GMM
from sklearn.mixture._gaussian_mixture import _compute_precision_cholesky
n_dims = 5
mu1 = np.random.randn(n_dims)
mu2 = np.random.randn(n_dims)
Sigma1 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
Sigma2 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
gmm = GMM(n_components=2)
gmm.weights_ = np.array([0.2, 0.8])
gmm.means_ = np.stack([mu1, mu2])
gmm.covariances_ = np.stack([Sigma1, Sigma2])
gmm.precisions_cholesky_ = _compute_precision_cholesky(gmm.covariances_, 'full')
X, y = gmm.sample(1000)
And depending on your covariance type you should change full accordingly as input to _compute_precision_cholesky (will be one of full, diag, tied, spherical).
I'm trying to pass two arrays for a fitting function, that takes both values.
Data file:
Column 1: Time
Column 2: Temperature
Column 3: Volume
Column 4: Pressure
0.000,0.946,4.668,0.981
0.050,0.946,4.668,0.981
0.100,0.946,4.669,0.981
0.150,0.952,4.588,0.996
0.200,1.025,4.008,1.117
0.250,1.210,3.093,1.361
0.300,1.445,2.299,1.652
0.350,1.650,1.803,1.887
0.400,1.785,1.524,2.038
0.450,1.867,1.340,2.145
0.500,1.943,1.138,2.280
0.550,2.019,0.958,2.411
0.600,2.105,0.750,2.587
0.650,2.217,0.542,2.791
0.700,2.332,0.366,2.978
0.750,2.420,0.242,3.116
0.800,2.444,0.219,3.114
0.850,2.414,0.219,3.080
here is the code
import numpy as np
from scipy.optimize import curve_fit
# Importing the Data
Time_Air1 = []
Vol_Air1 = []
Temp_Air1 = []
Pres_Air1 = []
with open('Good_Air_Run1.csv', 'r') as Air1:
reader = csv.reader(Air1, delimiter=',')
for row in reader:
Time_Air1.append(row[0])
Temp_Air1.append(row[1])
Vol_Air1.append(row[2])
Pres_Air1.append(row[3])
# Arrays are now passable floats
Time_Air1 = np.float32(np.array(Time_Air1))
Vol_Air1 = np.float32(np.array(Vol_Air1))
Temp_Air1 = np.float32(np.array(Temp_Air1))
Pres_Air1 = np.float32(np.array(Pres_Air1))
# fitting Model
def model_Gamma(V, gam, C):
return -gam*np.log(V) + C
# Air Data Fitting Data
x1 = Vol_Air1
y1 = Pres_Air1
p0_R1 = (1.0 ,1.0)
optR1, pcovR1 = curve_fit(model_Gamma, x1, y1, p0_R1)
gam_R1, C_R1 = optR1
gam_R1p, C_R1p = pcovR1
y1Mair = model_Gamma2(x_air1, gam_R1, C_R1)
compute the gamma coefficient, but it's not giving me the value i'm expecting, ~1.2. It gives me ~0.72
Yes this is the correct value because my friend fit the data into gnuplot and got that value.
If there is any information needed to actually try this, i'm happy to supply it.
Caveat: the result obtained here for gamma (about 1.7) still deviates from the postulated 1.2. This answer merely highlights the source of possible errors and illustrates what a good fit might look like.
You are trying to fit data where the dependent variable is related to the independent variable by a model that resembles that of adiabatic processes for ideal gases. Here, the pressure and the volume of a gass are related through
pressure * volume**gamma = constant
When you rearrange the left hand side and right hand side, you get:
pressure = constant * volume**-gamma
or in logarithmic form:
log(pressure) = log(constant) - gamma * log(volume)
You could fit the pressure data to the volume data using either of these 2 forms,
but the fit might not be optimal, because of measurement errors. One such error could be a fixed offset (e.g. some solid object is present in a beaker: the volume scale on the beaker will not represent accurately the volume of any liquid you pour in it).
When you account for such errors, often times a fit becomes markedly better.
Below, I've shown the fitting of your data using 3 models: the first is your model, the second takes into account a volume offset, and the third is a non-logarithmic variant of the 2nd model (it is basically the 2nd equation, with an optional volume offset). Note that in your code when you fit to what I call model1, you do not pass log(pressure) to the model, which would only make sense in case your pressure data is already tabulated on a logarithmic scale.
>>> import numpy as np
>>> from scipy.optimize import curve_fit
>>> data = np.genfromtxt('/tmp/datafile.txt',
... names=('time', 'temp', 'vol', 'press'), delimiter=',', usecols=range(4))
>>> def model1(volume, gamma, c):
... return np.log(c) - gamma*np.log(volume)
...
>>> def model2(volume, gamma, c, volume_offset):
... return np.log(c) - gamma*np.log(volume + volume_offset)
...
>>> def model3(volume, gamma, c, volume_offset):
... return c * (volume + volume_offset)**(-gamma)
...
>>> vol, press = data['vol'], data['press']
>>> guess1, _ = curve_fit(model1, vol, np.log(press))
>>> guess2, _ = curve_fit(model2, vol, np.log(press))
>>> guess3, _ = curve_fit(model3, vol, press)
>>> guess1, guess2, guess3
(array([ 0.38488521, 2.04536926]),
array([ 1.7269364 , 44.57369479, 4.44625865]),
array([ 1.73186133, 45.20087949, 4.46364872]))
>>> rms = lambda x: np.sqrt(np.mean(x**2))
>>> rms( press - np.exp(model1(vol, *guess1)))
0.29464410744456304
>>> rms(press - model3(vol, *guess3))
0.012672077620951249
Notice how guess2 and guess3 are nearly identical
The last two lines indicate the rms error. You'll notice that it is smaller for the model that takes into account the offset (if you plot them, you'll see the fit is much better than when you use model1*).
As a final remark, have a look at numpy's excellent functions for importing data, like the one I've shown here (np.genfromtxt), as they can save you a lot of tedious typing, like I demonstrated in this code.
Footnote: * when you plot using model1, don't forget to put everything back to linear scale, like this:
plt.plot(vol, np.exp(model1(vol, *guess1)))