Issues With scikit learn PCA package - python

I am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.
Here is a small set of that data:
wavelength intensity
[um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code I am using for this single data set is as follows:
pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)
The error code I get is this when I try to apply 2 pca components to one of the data sets:
ValueError: Datatype coercion is not allowed
Any help resolving would be much appreciated

Related

What is the correct way to convert a csv file with text to recordIO format?

I need to convert my dataset (includes text format) to recordIO format. I have tried below code. However, I am unable to fix the below error. Do I need to make further changes in my data format?
ValueError: Unsupported dtype object on array
Code:
import io
import sagemaker.amazon.common as smac
X = df[['Subject','Body']].to_numpy()
y = df[['Label']].to_numpy()
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, X, y)
buf.seek(0)
Dataset example-
Label Subject Body
label a Test one Test Body
label b Test two Test second
According to documentation in "Common Data Formats for Training",
your content-type is associated with the algorithms in the following table:
ContentType
Algorithm
application/x-recordio
Object Detection Algorithm
application/x-recordio-protobuf
Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence
Looking at the guide in documentation (Data conversion), the data should be passed as arrays of numbers, not strings.
This means that an encoder of some kind is needed (e.g. LabelEncoder for labels precisely, but an encoding/embedding algorithm would be needed for the remaining data). Based on the result you want to achieve, you can decide what to use from a variety of methods such as One-hot-encoding, binary encoding, one-of-k-encoding or whatever or even complex word/sentence embedding algorithms.
For example, for a text classification task with RFC/SVM, it is first necessary to encode the text with more or less expressive embedding algorithms (e.g. fastText).

does smf.ols() model require data scaling?

I have a dataframe with multiple x columns and one y column. I'd like to predict the linear relationship between y and multiple x variables.
so I am using smf.ols() model to predict the formula. I am wondering if I need to scale the data before fit the data using ols().
I checked ols website and it seems that they never talk about data scaling , for example, below website
https://www.statsmodels.org/devel/example_formulas.html
at the mean time, I used to take a course from datacamp and they don't mention about data scaling either, for example, below screenshot from datacamp course. You can see the regressed coefficient for each variable is not in the same order, like 3655 vs 83.
Here is what I did for my regression. I am wondering for my below example if we need to add scaling like
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(df_crossplot)
df_scaled=scaler.transform(df_crossplot)
then after that, I input df_scaled into the below function? do I have to do this above step? My hesitation is, if I scale it, then how to convert regressed formula back to a new formula based on original scale? Thanks for your help.
import statsmodels.formula.api as smf
def linear_regression_statsmodel(df_crossplot,crossplot_y,crossplot_x_list):
formula_crossplot=crossplot_y+'~'
for x in crossplot_x_list:
formula_crossplot=formula_crossplot+'+'+x
model_crossplot=smf.ols(formula=formula_crossplot,data=df_crossplot).fit()
df_crossplot['regressed']=model_crossplot.params[0]
regressed_x_string=f'{model_crossplot.params[0]:,.2f}'
for ix,x in enumerate(crossplot_x_list):
df_crossplot['regressed']=df_crossplot['regressed']+df_crossplot[x]*model_crossplot.params[ix+1]
if model_crossplot.params[ix+1]>0:
regressed_x_string=regressed_x_string+f'+{model_crossplot.params[ix+1]:,.2f}*{x}'
else: # no need + sign since we have already negative sign
regressed_x_string=regressed_x_string+f'{model_crossplot.params[ix+1]:,.2f}*{x}'
return df_crossplot,model_crossplot,regressed_x_string

PCA of stock returns

I have a particular stock returns and want to find which of these returns can be used to explain the whole set of returns. Hence I am using PCA to the top 2 returns to explain the returns of a stock. I have taken the log return of the stock.
My code looks like this:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcadata = stock['lr']
pca.fit(pcadata)
first_pc= pca.components_[0]
second_pc = pca.components_[1]
When i run this, I get this error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
How do i resolve this error?
PCA is a dimension-reduction procedure therefore you need a 2D array of samples x variables. PCA will then look for the combinations of variables that vary the most within these samples. It looks like you are only including one variable which is stock['lr']; therefore you receive the error. Perhaps you could give us a little more explanation about your data so that we could deduce how you should input your data.
Reading your comments (I can't reply because I need 50 reputations to do that...), I think you might have mistaken the use of PCA. You are looking for representative sample while PCA gives you 'representative' variables.

Pymc3: Optimizing parameters with multiple data?

I've designed a model using Pymc3, and I have some trouble optimizing it with multiple data.
The model is a bit similar to the coal-mining disaster (as in the Pymc3 tutorial for those who know it), except there are multiple switchpoints.
The output of the network is a serie of real numbers for instance:
[151,152,150,20,19,18,0,0,0]
with Model() as accrochage_model:
time=np.linspace(0,n_cycles*data_length,n_cycles*data_length)
poisson = [Normal('poisson_0',5,1), Normal('poisson_1',10,1)]
variance=3
t = [Normal('t_0',0.5,0.01), Normal('t_1',0.7,0.01)]
taux = [Bernoulli('taux_{}'.format(i),t[i]) for i in range(n_peaks)]
switchpoint = [Poisson('switchpoint_{}'.format(i),poisson[i])*taux[i] for i in range(n_peaks)]
peak=[Normal('peak_0',150,2),Normal('peak_1',50,2),Normal('peak_2',0,2)]
z_init=switch(switchpoint[0]>=time%n_cycles,0,peak[0])
z_list=[switch(sum(switchpoint[j] for j in range(i))>=time%n_cycles,0,peak[i]-peak[i-1]) for i in range(1,n_peaks)]
z=(sum(z_list[i] for i in range(len(z_list))))
z+=z_init
m =Normal('m', z, variance,observed=data)
I have multiple realisations of the true distribution and I'd like taking all of them into account while performing optimization of the parameters of the system.
Right now my "data" that appears in observed=data is just one list of results , such as:
[151,152,150,20,19,18,0,0,0]
What I would like to do is give not just one but several lists of results,
for instance:
data=([151,152,150,20,19,18,0,0,0],[145,152,150,21,17,19,1,0,0],[151,149,153,17,19,18,0,0,1])
I tried using the shape parameter and making data an array of results but none of it seemed to work.
Does anyone have an idea of how it's possible to do the inference so that the network is optimized for an entire dataset and not a single output?

Scikit-learn feature selection for regression data

I am trying to apply a univariate feature selection method using the Python module scikit-learn to a regression (i.e. continuous valued response values) dataset in svmlight format.
I am working with scikit-learn version 0.11.
I have tried two approaches - the first of which failed and the second of which worked for my toy dataset but I believe would give meaningless results for a real dataset.
I would like advice regarding an appropriate univariate feature selection approach I could apply to select the top N features for a regression dataset. I would either like (a) to work out how to make the f_regression function work or (b) to hear alternative suggestions.
The two approaches mentioned above:
I tried using sklearn.feature_selection.f_regression(X,Y).
This failed with the following error message:
"TypeError: copy() takes exactly 1 argument (2 given)"
I tried using chi2(X,Y). This "worked" but I suspect this is because the two response values 0.1 and 1.8 in my toy dataset were being treated as class labels? Presumably, this would not yield a meaningful chi-squared statistic for a real dataset for which there would be a large number of possible response values and the number in each cell [with a particular response value and value for the attribute being tested] would be low?
Please find my toy dataset pasted into the end of this message.
The following code snippet should give the results I describe above.
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func="one of the two functions I refer to above",k=2) #sorry, I hope this message is clear
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))] #This should print the indices of the top 2 features
Thanks in advance.
Richard
Contents of my contrived svmlight file - with additional blank lines inserted for clarity:
1.8 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1.8 1:1.000000 2:1.000000#mB
0.1 5:1.000000#mC
1.8 1:1.000000 2:1.000000#mD
0.1 3:1.000000 4:1.000000#mE
0.1 3:1.000000#mF
1.8 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
1.8 2:1.000000#mH
As larsmans noted, chi2 cannot be used for feature selection with regression data.
Upon updating to scikit-learn version 0.13, the following code selected the top two features (according to the f_regression test) for the toy dataset described above.
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center=False) #center=True (the default) would not work ("ValueError: center=True only allowed for dense data") but should presumably work in general
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func=f_regression,k=2)
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))]
You could also try to do feature selection by L1/Lasso regularization. The class specifically designed for this is RandomizedLasso which will train LassoRegression on multiple subsamples of your data and select features that are selected most frequently by these models. You can also just use Lasso, LassoLars or SGDClassifier to do same thing without the benefit of resampling but faster.

Categories

Resources