does smf.ols() model require data scaling?

does smf.ols() model require data scaling? - python

I have a dataframe with multiple x columns and one y column. I'd like to predict the linear relationship between y and multiple x variables.
so I am using smf.ols() model to predict the formula. I am wondering if I need to scale the data before fit the data using ols().
I checked ols website and it seems that they never talk about data scaling , for example, below website
https://www.statsmodels.org/devel/example_formulas.html
at the mean time, I used to take a course from datacamp and they don't mention about data scaling either, for example, below screenshot from datacamp course. You can see the regressed coefficient for each variable is not in the same order, like 3655 vs 83.
Here is what I did for my regression. I am wondering for my below example if we need to add scaling like
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(df_crossplot)
df_scaled=scaler.transform(df_crossplot)
then after that, I input df_scaled into the below function? do I have to do this above step? My hesitation is, if I scale it, then how to convert regressed formula back to a new formula based on original scale? Thanks for your help.
import statsmodels.formula.api as smf
def linear_regression_statsmodel(df_crossplot,crossplot_y,crossplot_x_list):
formula_crossplot=crossplot_y+'~'
for x in crossplot_x_list:
formula_crossplot=formula_crossplot+'+'+x
model_crossplot=smf.ols(formula=formula_crossplot,data=df_crossplot).fit()
df_crossplot['regressed']=model_crossplot.params[0]
regressed_x_string=f'{model_crossplot.params[0]:,.2f}'
for ix,x in enumerate(crossplot_x_list):
df_crossplot['regressed']=df_crossplot['regressed']+df_crossplot[x]*model_crossplot.params[ix+1]
if model_crossplot.params[ix+1]>0:
regressed_x_string=regressed_x_string+f'+{model_crossplot.params[ix+1]:,.2f}*{x}'
else: # no need + sign since we have already negative sign
regressed_x_string=regressed_x_string+f'{model_crossplot.params[ix+1]:,.2f}*{x}'
return df_crossplot,model_crossplot,regressed_x_string

Related

PCA of stock returns

I have a particular stock returns and want to find which of these returns can be used to explain the whole set of returns. Hence I am using PCA to the top 2 returns to explain the returns of a stock. I have taken the log return of the stock.
My code looks like this:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcadata = stock['lr']
pca.fit(pcadata)
first_pc= pca.components_[0]
second_pc = pca.components_[1]
When i run this, I get this error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
How do i resolve this error?

PCA is a dimension-reduction procedure therefore you need a 2D array of samples x variables. PCA will then look for the combinations of variables that vary the most within these samples. It looks like you are only including one variable which is stock['lr']; therefore you receive the error. Perhaps you could give us a little more explanation about your data so that we could deduce how you should input your data.
Reading your comments (I can't reply because I need 50 reputations to do that...), I think you might have mistaken the use of PCA. You are looking for representative sample while PCA gives you 'representative' variables.

Chi-Square test for groups of unequal size

I'd like to apply chi-square test scipy.stats.chisquare. And the total number of observations is different in my groups.
import pandas as pd
data={'expected':[20,13,18,21,21,29,45,37,35,32,53,38,25,21,50,62],
'observed':[19,10,15,14,15,25,25,20,26,38,50,36,30,28,59,49]}
data=pd.DataFrame(data)
print(data.expected.sum())
print(data.observed.sum())
To ignore this is incorrect - right?
Does the default behavior of scipy.stats.chisquare takes this into account? I checked with pen and paper and looks like it doesn't. Is there a parameter for this?
from scipy.stats import chisquare
# incorrect since the number of observations is unequal
chisquare(f_obs=data.observed, f_exp=data.expected)
When I do manual adjustment I get slightly different result.
# adjust actual number of observations
data['obs_prop']=data['observed'].apply(lambda x: x/data['observed'].sum())
data['observed_new']=data['obs_prop']*data['expected'].sum()
# proper way
chisquare(f_obs=data.observed_new, f_exp=data.expected)
Please correct me if I am wrong at some point. Thanks.
ps: I tagged R for additional statistical expertise

Basically this was a different statistical problem - Chi-square test of independence of variables in a contingency table.
from scipy.stats import contingency as cont
chi2, p, dof, exp=cont.chi2_contingency(data)
p

I didn't get the question quite well. However, the way I see it is that you can use scipy.stats.chi2_contingency if you want to compute the independence test between two categorical variable.
Also the scipy.stats.chi2_sqaure can be used to compare observed vs expected. Here the number of categories should be the same. Logicaly a category would get a 0 frequency if there is an observed frequecy but the expeceted frequency does not exist and vice-versa.
Hope this helps

Converting a pandas Interval into a string (and back again)

I'm relatively new to Python and am trying to get some data prepped to train a RandomForest. For various reasons, we want the data to be discrete, so there are a few continuous variables that need to be discretized. I found qcut in pandas, which seems to do what I want - I can set a number of bins, and it will discretize the variable into that many bins, trying to keep the counts in each bin even.
However, the output of pandas.qcut is a list of Intervals, and the RandomForest classifier in scikit-learn needs a string. I found that I can convert an interval into a string by using .astype(str). Here's a quick example of what I'm doing:
import pandas as pd
from random import sample
vals = sample(range(0,100), 100)
cuts = pd.qcut(vals, q=5)
str_cuts = pd.qcut(vals, q=5).astype(str)
and then str_cuts is one of the variables passed into a random forest.
However, the intent of this system is to train a RandomForest, save it to a file, and then allow someone to load it at a later date and get a classification for a new test instance, that is not available at training time. And because the classifier was trained on discretized data, the new test instance will need to be discretized before it can be used. So what I want to be able to do is read in a new instance, apply the already-established discretization scheme to it, convert it to a string, and run it through the random forest. However, I'm getting hung up on the best way to 'apply the discretization scheme'.
Is there an easy way to handle this? I assume there's no straight-forward way to convert a string back into an Interval. I can get the list of all Interval values from the discretization (ex: cuts.unique()) and apply that at test-time, but that would require saving/loading a discretization dictionary alongside the random forest, which seems clunky, and I worry about running into issues trying to recreate a categorical variable (coming mostly from R, which is extremely particular about the format of categorical variables). Or is there another way around this that I'm not seeing?

Use the labelsargument in qcut and use pandas Categorical.
Either of those can help you create categories instead of interval for your variable. Then, you can use a form of encoding, for example Label Encoding or Ordinal Encoding to convert the categories (the factors if you're used to R) to numerical values which the Forest will be able to use.
Then the process goes :
cutting => categoricals => encoding
and you don't need to do it by hand anymore.
Lastly, some gradient boosted trees libraries have support for categorical variables though it's not a silver bullet and will depend on your goal. See catboost and lightgbm.

For future searchers, there are benefits to using transformers from scikit-learn instead of pandas. In this case, KBinsDiscretizer is the scikit equivalent of qcut.
It can be used in a pipeline, which will handle applying the previously-learned discretization to unseen data without the need for storing the discretization dictionary separately or round trip string conversion. Here's an example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import KBinsDiscretizer
pipeline = make_pipeline(KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile'),
RandomForestClassifier())
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
If you really need to convert back and forth between pandas IntervalIndex and string, you'll probably need to do some parsing as described in this answer: https://stackoverflow.com/a/65296110/3945991 and either use FunctionTransformer or write your own Transformer for pipeline integration.

While it may not be the cleanest-looking method, converting a string back into an interval is indeed possible:
import pandas as pd
str_intervals = [i.replace("(","").replace("]", "").split(", ") for i in str_cuts]
original_cuts = [pd.Interval(float(i), float(j)) for i, j in str_intervals]

Fitting on a semi-logarithmic scale and transfering it back to normal?

I am working with IFFT and have a set of real and imaginary values with their respective frequencies (x-axis). The frequencies are not equidistant, I can't use a discrete IFFT, and I am unable to fit my data correctly, because the values are so jumpy at the beginning. So my plan is to "stretch out" my frequency data points on a lg-scale, fit them (with polyfit) and then return - somehow - to normal scale.
f = data[0:27,0] #x-values
re = daten[0:27,5] #y-values
lgf = p.log10(f)
polylog_re = p.poly1d(p.polyfit(lgf, re, 6))
The fit works definitely better (http://imgur.com/btmC3P0), but is it possible to then transform my polynom back into the normal x-scaling? Right now I'm using those logarithmic fits for my IFFT and take the log10 of my transformed values for plotting etc., but that probably defies all mathematical logic and results in errors.

Your fit is perfectly valid but not a regular polynomial fit. By using log_10(x), you use another model function. Something like y(x)=sum(a_i * 10^(x_i^i). If this is okay for you, you are done. When you wan't to do some more maths, I would suggest using the natural logarithm instead the one to base 10.

Scikit-learn feature selection for regression data

I am trying to apply a univariate feature selection method using the Python module scikit-learn to a regression (i.e. continuous valued response values) dataset in svmlight format.
I am working with scikit-learn version 0.11.
I have tried two approaches - the first of which failed and the second of which worked for my toy dataset but I believe would give meaningless results for a real dataset.
I would like advice regarding an appropriate univariate feature selection approach I could apply to select the top N features for a regression dataset. I would either like (a) to work out how to make the f_regression function work or (b) to hear alternative suggestions.
The two approaches mentioned above:
I tried using sklearn.feature_selection.f_regression(X,Y).
This failed with the following error message:
"TypeError: copy() takes exactly 1 argument (2 given)"
I tried using chi2(X,Y). This "worked" but I suspect this is because the two response values 0.1 and 1.8 in my toy dataset were being treated as class labels? Presumably, this would not yield a meaningful chi-squared statistic for a real dataset for which there would be a large number of possible response values and the number in each cell [with a particular response value and value for the attribute being tested] would be low?
Please find my toy dataset pasted into the end of this message.
The following code snippet should give the results I describe above.
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func="one of the two functions I refer to above",k=2) #sorry, I hope this message is clear
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))] #This should print the indices of the top 2 features
Thanks in advance.
Richard
Contents of my contrived svmlight file - with additional blank lines inserted for clarity:
1.8 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1.8 1:1.000000 2:1.000000#mB
0.1 5:1.000000#mC
1.8 1:1.000000 2:1.000000#mD
0.1 3:1.000000 4:1.000000#mE
0.1 3:1.000000#mF
1.8 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
1.8 2:1.000000#mH

As larsmans noted, chi2 cannot be used for feature selection with regression data.
Upon updating to scikit-learn version 0.13, the following code selected the top two features (according to the f_regression test) for the toy dataset described above.
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center=False) #center=True (the default) would not work ("ValueError: center=True only allowed for dense data") but should presumably work in general
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func=f_regression,k=2)
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))]

You could also try to do feature selection by L1/Lasso regularization. The class specifically designed for this is RandomizedLasso which will train LassoRegression on multiple subsamples of your data and select features that are selected most frequently by these models. You can also just use Lasso, LassoLars or SGDClassifier to do same thing without the benefit of resampling but faster.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.