PCA of stock returns - python

I have a particular stock returns and want to find which of these returns can be used to explain the whole set of returns. Hence I am using PCA to the top 2 returns to explain the returns of a stock. I have taken the log return of the stock.
My code looks like this:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcadata = stock['lr']
pca.fit(pcadata)
first_pc= pca.components_[0]
second_pc = pca.components_[1]
When i run this, I get this error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
How do i resolve this error?

PCA is a dimension-reduction procedure therefore you need a 2D array of samples x variables. PCA will then look for the combinations of variables that vary the most within these samples. It looks like you are only including one variable which is stock['lr']; therefore you receive the error. Perhaps you could give us a little more explanation about your data so that we could deduce how you should input your data.
Reading your comments (I can't reply because I need 50 reputations to do that...), I think you might have mistaken the use of PCA. You are looking for representative sample while PCA gives you 'representative' variables.

Related

does smf.ols() model require data scaling?

I have a dataframe with multiple x columns and one y column. I'd like to predict the linear relationship between y and multiple x variables.
so I am using smf.ols() model to predict the formula. I am wondering if I need to scale the data before fit the data using ols().
I checked ols website and it seems that they never talk about data scaling , for example, below website
https://www.statsmodels.org/devel/example_formulas.html
at the mean time, I used to take a course from datacamp and they don't mention about data scaling either, for example, below screenshot from datacamp course. You can see the regressed coefficient for each variable is not in the same order, like 3655 vs 83.
Here is what I did for my regression. I am wondering for my below example if we need to add scaling like
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(df_crossplot)
df_scaled=scaler.transform(df_crossplot)
then after that, I input df_scaled into the below function? do I have to do this above step? My hesitation is, if I scale it, then how to convert regressed formula back to a new formula based on original scale? Thanks for your help.
import statsmodels.formula.api as smf
def linear_regression_statsmodel(df_crossplot,crossplot_y,crossplot_x_list):
formula_crossplot=crossplot_y+'~'
for x in crossplot_x_list:
formula_crossplot=formula_crossplot+'+'+x
model_crossplot=smf.ols(formula=formula_crossplot,data=df_crossplot).fit()
df_crossplot['regressed']=model_crossplot.params[0]
regressed_x_string=f'{model_crossplot.params[0]:,.2f}'
for ix,x in enumerate(crossplot_x_list):
df_crossplot['regressed']=df_crossplot['regressed']+df_crossplot[x]*model_crossplot.params[ix+1]
if model_crossplot.params[ix+1]>0:
regressed_x_string=regressed_x_string+f'+{model_crossplot.params[ix+1]:,.2f}*{x}'
else: # no need + sign since we have already negative sign
regressed_x_string=regressed_x_string+f'{model_crossplot.params[ix+1]:,.2f}*{x}'
return df_crossplot,model_crossplot,regressed_x_string

Not able to fix the error I get while writing code for analyzing "California Housing" data set from O'Reilly book

while executing code:(from book page 69 of "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Book by Aurelien Geron")
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
I get error:-
ValueError: Expected 2D array, got 1D array instead:
array=['<1H OCEAN' '<1H OCEAN' 'NEAR OCEAN' ... 'INLAND' '<1H OCEAN' 'NEAR BAY'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
How do I fix it ?
In this case the error describes the issue and a way to solve it. The function .fit_transform() is expecting a 2D array, not a 1D one. A way of achieving this, is using .reshape(). Since we are passing a single column (feature) then we should use -1,1.
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat.reshape(-1,1))
If housing_cat is a pandas series, then you might have to use:
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat.values.reshape(-1,1))

How to convert a Python dictionary to a Numpy array?

So the logistic regression from the sklearn library from Python has the .fit() function which takes x_train(features) and y_train(labels) as arguments to train the classifier.
It seems that x_train.shape = (number_of_samples, number_of_features)
For x_train I should use the extracted xvector.scp file, which I am reading like so:
b = kaldiio.load_scp('xvector.scp')
And I can print the content like so:
for file_id in b:
xvector = b[file_id]
print(xvector)
Right now the b variable is like a dictionary and you can get the x-vector value of the corresponding id. I want to use sklearn Logistic Regression to classify the x-vectors and in order to use the .fit() method I should pass an array as an argument.
My question is how can I make an array that contains only the xvector variables?
PS: the file_ids are like 1 million and each xvector has length of 512, which is too big for an array
It seems you are trying to store the dictionary into a numpy array. If the dictionary is small, you can directly store the values as:
import numpy as np
x = np.array(list(b.values()))
However, this will run into OOM issues if the dictionary is large. In this case, you would need to use np.memmap as explained here: https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/
Essentially, you have to add rows to the array one at a time, and flush it when you have run out of memory. The array is stored directly on the disk, so it avoids OOM issues.

Issues With scikit learn PCA package

I am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.
Here is a small set of that data:
wavelength intensity
[um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code I am using for this single data set is as follows:
pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)
The error code I get is this when I try to apply 2 pca components to one of the data sets:
ValueError: Datatype coercion is not allowed
Any help resolving would be much appreciated

CNTK & python: How to pass input data to the eval func?

With CNTK I have created a network with 2 input neurons and 1 output neuron.
A line in the training file looks like
|features 1.567518 2.609619 |labels 1.000000
Then the network was trained with brain script. Now I want to use the network for predicting values. For example: Input data is [1.82, 3.57]. What ist the output from the net?
I have tried Python with the following code, but here I am new. Code does not work. So my question is: How to pass the input data [1.82, 3.57] to the eval function?
On stackoverflow there are some hints, here and here, but this is too abstract for me.
Thank you.
import cntk as ct
import numpy as np
z = ct.load_model("LR_reg.dnn", ct.device.cpu())
input_data= np.array([1.82, 3.57], dtype=np.float32)
pred = z.eval({ z.arguments[0] : input_data })
print(pred)
Here's the most defensive way of doing it. CNTK can be forgiving if you omit some of this when the network is specified with V2 constructs. Not sure about a network that was created with V1 code.
Basically you need a pair of braces for each axis. Which axes exist in Brainscript? There's a batch axis, a sequence axis and then the static axes of your network. You have one dimensional data so that means the following should work:
input_data= np.array([[[1.82, 3.57]]], dtype=np.float32)
This specifies a batch of one sequence, of length one, containing one 1d vector of two elements. You can also try omitting the outermost braces and see if you are getting the same result.
Update based on more information from the comment below, we should not forget that the V1 code also saved the part of the network that computes things like loss and accuracy. If we provide only the features, CNTK will complain that the labels have not been provided. There are two ways to deal with this issue. One possibility is to provide some fake labels, so that the network can evaluate these auxiliary operations. Another possibility is to identify the prediction and use that. If the prediction was called 'p' in V1, this python code
p = z.find_by_name('p')
should create a CNTK function that only needs the features in order to compute the prediction.

Categories

Resources