Python - OLS Regression Estimation using either statsmodels or scikit-learn - python

I want to run a rolling for example of 3 window OLS regression estimation for a dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format. The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate).
time X Y
0.000543 0 10
0.000575 0 10
0.041324 1 10
0.041331 2 10
0.041336 3 10
0.04134 4 10
...
9.987735 55 239
9.987739 56 239
9.987744 57 239
9.987749 58 239
9.987938 59 239
Using a simple OLS regression estimation, I have tried it with the following script.
# /usr/bin/python -tt
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('estimated_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],
window_type='rolling', window=3, intercept=True)
df['Y_hat'] = model.y_predict
print(df['Y_hat'])
print (model.summary)
df.plot.scatter(x='X', y='Y', s=0.1)
However, using either statsmodels or scikit-learn seems to be a good choice for something beyond the a simple regression. I have tried to make the following script work but returing IndexError: index out of bounds with higher subset of the dataset (for example for more than 100 rows of the dataset).
# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm
df=pd.read_csv('estimated_pred.csv')
df=df.dropna() # to drop nans in case there are any
window = 3
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
temp=df.iloc[i-window:i,:]
RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']])).fit()
df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]
#The following line gives us predicted values in a row, given the PRIOR row's estimated parameters
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']
print(df['predicted'])
#print(df['b2'])
#print(RollOLS.predict(sm.add_constant(predict_x)))
print(temp)
I want to do a prediction of Y (i.e. predict the current value of Y according to the previous value 3 rolling values of X. Finally, i want to include the mean squared error (MSE) for all the prediction (a summary of the regression analysis). For example, if we look at row 5, the value of X is 2 and the value of Y is 10. Let's say the prediction value of y at the current row is 6 and therefore the mse will be (10-6)^2. How can we do this using either statsmodels or scikit-learn for pd.stats.ols.MovingOLS was removed in Pandas version 0.20.0 and since I can't find any reference?

Related

How can I use SVM classifier to detect outliers in percentage changes?

I have a pandas dataframe that is in the following format:
This contains the % change in stock prices each day for 3 companies MSFT, F and BAC.
I would like to use a OneClassSVM calculator to detect whether the data is an outlier or not. I have tried the following code, which I believe detects the rows which contain outliers.
#Import libraries
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
#Create SVM Classifier
svm = OneClassSVM(kernel='rbf',
gamma=0.001, nu=0.03)
#Use svm to fit and predict
svm.fit(delta)
pred = svm.predict(delta)
#If the values are outlier the prediction
#would be -1
outliers = where(pred==-1)
#Print rows with outliers
print(outliers)
This gives the following output:
I would like to then add a new column to my dataframe that includes whether the data is an outlier or not. I have tried the following code but I get an error due to the lists being different lengths as shown below.
condition = (delta.index.isin(outliers))
assigned_value = "outlier"
df['isoutlier'] = np.select(condition,
assigned_value)
Would you be able to let me know I could add this column given that the list of the rows containing outliers is much shorter please?
It's not very clear what is delta and df in your code. I am assuming they are the same data frame.
You can use the result from svm.predict , here we leave it as blank '' if not outlier:
import numpy as np
df = pd.DataFrame(np.random.uniform(0,1,(100,3)),columns=['A','B','C'])
svm = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.03)
svm.fit(df)
pred = svm.predict(df)
df['isoutlier'] = np.where(pred == -1 ,'outlier','')
A B C isoutlier
0 0.869475 0.752420 0.388898
1 0.177420 0.694438 0.129073
2 0.011222 0.245425 0.417329
3 0.791647 0.265672 0.401144
4 0.538580 0.252193 0.142094
.. ... ... ... ...
95 0.742192 0.079426 0.676820 outlier
96 0.619767 0.702513 0.734390
97 0.872848 0.251184 0.887500 outlier
98 0.950669 0.444553 0.088101
99 0.209207 0.882629 0.184912

Chi-square test P-value from resampled method vs scipy.stats.chi2_contigency

This question references to book "O'Relly Practical Statistics for Data Scientists 2nd Edition" chapter 3, session Chi-Square Test.
The book provides an example of one Chi-square test case, where it assumes a website with three different headlines that run by 1000 visitors. The result shows the # of clicks from each headline.
The Observed data is the following:
Headline A B C
Click 14 8 12
No-click 986 992 988
The expected value is calculated in the following:
Headline A B C
Click 11.13 11.13 11.13
No-click 988.67 988.67 988.67
The Pearson residual is defined as:
Where the table is now:
Headline A B C
Click 0.792 -0.990 0.198
No-click -0.085 0.106 -0.021
The Chi-square statistic is the sum of the squared Pearson residuals: . Which is 1.666
So far so good.
Now here comes the resampling part:
1. Assuming a box of 34 ones and 2966 zeros
2. Shuffle, and take three samples of 1000 and count how many ones(Clicks)
3. Find the squared differences between the shuffled counts and expected counts then sum them.
4. Repeat steps 2 to 3, a few thousand times.
5. The P-value is how often does the resampled sum of squared deviations exceed the observed.
The resampling python test code is provided by the book as following:
(Can be downloaded from https://github.com/gedeck/practical-statistics-for-data-scientists/tree/master/python/code)
## Practical Statistics for Data Scientists (Python)
## Chapter 3. Statistial Experiments and Significance Testing
# > (c) 2019 Peter C. Bruce, Andrew Bruce, Peter Gedeck
# Import required Python packages.
from pathlib import Path
import random
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import power
import matplotlib.pylab as plt
DATA = Path('.').resolve().parents[1] / 'data'
# Define paths to data sets. If you don't keep your data in the same directory as the code, adapt the path names.
CLICK_RATE_CSV = DATA / 'click_rates.csv'
...
## Chi-Square Test
### Chi-Square Test: A Resampling Approach
# Table 3-4
click_rate = pd.read_csv(CLICK_RATE_CSV)
clicks = click_rate.pivot(index='Click', columns='Headline', values='Rate')
print(clicks)
# Table 3-5
row_average = clicks.mean(axis=1)
pd.DataFrame({
'Headline A': row_average,
'Headline B': row_average,
'Headline C': row_average,
})
# Resampling approach
box = [1] * 34
box.extend([0] * 2966)
random.shuffle(box)
def chi2(observed, expected):
pearson_residuals = []
for row, expect in zip(observed, expected):
pearson_residuals.append([(observe - expect) ** 2 / expect
for observe in row])
# return sum of squares
return np.sum(pearson_residuals)
expected_clicks = 34 / 3
expected_noclicks = 1000 - expected_clicks
expected = [34 / 3, 1000 - 34 / 3]
chi2observed = chi2(clicks.values, expected)
def perm_fun(box):
sample_clicks = [sum(random.sample(box, 1000)),
sum(random.sample(box, 1000)),
sum(random.sample(box, 1000))]
sample_noclicks = [1000 - n for n in sample_clicks]
return chi2([sample_clicks, sample_noclicks], expected)
perm_chi2 = [perm_fun(box) for _ in range(2000)]
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p-value: {resampled_p_value:.4f}')
chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')
Now, i ran the perm_fun(box) for 2,000 times and obtained a resampled P-value of 0.4775.
However, if I ran perm_fun(box) for 10,000 times, and 100,000 times, I was able to obtain a resampled P-value of 0.84 both times. It seems to me the P-value should be around 0.84.
Why is the stats.chi2_contigency showing a such smaller numbers?
The result I get for running 2000 times is:
Observed chi2: 1.6659
Resampled p-value: 0.8300
Observed chi2: 1.6659
p-value: 0.4348
And if I were to run it 10,000 times, the result is:
Observed chi2: 1.6659
Resampled p-value: 0.8386
Observed chi2: 1.6659
p-value: 0.4348
software version:
pandas.__version__: 0.25.1
numpy.__version__: 1.16.5
scipy.__version__: 1.3.1
statsmodels.__version__: 0.10.1
sys.version_info: 3.7.4
I ran your code trying 2000, 10000, and 100000 loops, and all three times I got close to .47. I did, however, get an error at this line that I had to fix:
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
Here perm_chi2 is a list and chi2observed is a float, so I wonder how this code ever ran for you (perhaps whatever you did to fix it was the source of the error). In any case, changing it to the intended
resampled_p_value = sum([1*(x > chi2observed) for x in perm_chi2]) / len(perm_chi2)
allowed me to run it and get close to .47.
Make sure that when you change the number of iterations, you do so only by changing the 2000, none of the other numbers.

Clustering with mixed data type

Currently my data frame consist of both numerical and categorical values (mixed data type). My data frame looks like -
id age txn_duration Statename amount gender religion
1 27 275 bihar 110 m hindu
2 33 163 maharashtra 50 f muslim
3 53 63 delhi 50 f muslim
4 47 100 up 50 m hindu
5 39 263 punjab 100 m punjabi
6 41 303 delhi 50 m punjabi
There is 20 states (Statename) and 7 religion. I have done get_dummies for both Statename and rekigion but got lots of noise. Also detect outlier.My question is -
1. how to find optimum no of cluster for mixed data type.
2. In this case I am using k-means algo.Can I use k-modes or any other methods which will help my results. Because I am not getting good results using k-means
3.How to interpretation my cluster results. I have use
print (cluster_data[clmns].groupby(['clusters']).mean())
Any other way I can see or plot?please provide me the code
My code is -
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
#Importing libraries
import os
import matplotlib.pyplot as plt#visualization
from PIL import Image
%matplotlib inline
import seaborn as sns#visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
from scipy import stats
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes
cluster_data = pd.read_csv("cluster.csv")
cluster_data = pd.get_dummies(cluster_data, columns=['StateName'])
cluster_data = pd.get_dummies(cluster_data, columns=['gender'])
cluster_data = pd.get_dummies(cluster_data, columns=['religion'])
clmns = ['mobile', 'age', 'txn_duration', 'amount', 'StateName_Bihar',
'StateName_Delhi', 'StateName_Gujarat', 'StateName_Karnataka',
'StateName_Maharashtra', 'StateName_Punjab', 'StateName_Rajasthan',
'StateName_Telangana', 'StateName_Uttar Pradesh',
'StateName_West Bengal', 'gender_female',
'gender_male', 'religion_buddhist',
'religion_christian', 'religion_hindu',
'religion_jain', 'religion_muslim',
'religion_other', 'religion_sikh']
df_tr_std = stats.zscore(cluster_data[clmns])
#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
#Glue back to originaal data
cluster_data['clusters'] = labels
clmns.extend(['clusters'])
#Lets analyze the clusters
print (cluster_data[clmns].groupby(['clusters']).mean())
You can run something like this code:
Look at the image attached, in that plot you can see that having more than 3 clusters (for the dataset it was run on) does not provide a significant decrease in distortion. So optimum cluster number would be 3 in that case (simple synthetic data). For noisy data the decision might be harder.
Reference: A. Mueller's scipy notes on sklearn
import matplotlib.pyplot as plt
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i,
random_state=0)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
Edit for ValueError:
For ValueError: you need just numericals, so you can do like this:
df_numerics = df.drop(['Statename', 'gender', 'religion], axis=1)
You can also drop other columns that you don't want included in clustering analysis.
with df_numerics, try the elbow method and try to find a good cluster number.
Then, let's say you found out that 3 clusters was good, you can run:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
labels contains the cluster numbers (0,1,2 for 3-clusters) for each row in your dataframe.You can also save this as a column in you datafame:
df['cluster_labels'] = labels
Then to visualize it you can pick 2 columns (more than that is dificult to visualize). Let's say you picked 'txn_duration' and 'amount' you can plot those columns, and add the cluster labels as color like this:
import matplotlib.pyplot as plt
plt.scatter(df['txn_duration'],df['amount'], c=df['cluster_labels'])

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

logistic regression ValueError: Found input variables with inconsistent numbers of samples: [699,

I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
first i label my data frame
# imports
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression
col=['Id','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion',
'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
# read data into a DataFrame
data = pd.read_csv("breast_cancer.txt",header=None, prefix="V")
data.columns = col
d = pd.DataFrame(data,columns=col)
second i filled all missing values with mean of corresponding feature
list_of_means = d.mean()
# filling missing values with mean
for i in range (2,10):
for j in range(699):
if d.iloc[j, i] == "?":
d.iloc[j, i] = round(list_of_means[i],0)
d['Type'] = 'benign'
# map Type to 0 if class is 2 and 1 if class is 4
d['Type'] = d.Class.map({2:0, 4:1})
X = d[['Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion',
'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses']]
third i created new column and named Type to map class 2 to type 0 and class 4 to type 1
y=[['Type']]
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X.reshape(X.shape[0], 1), y)
# check the accuracy on the training set
score = model.score(X, y)
#calculate correlation matrix
corMat = DataFrame(data.iloc[:,2:10].corr())
print 'correlation matrix'
print(corMat)
print score
print X.head()
however i am getting this error
logistic regression ValueError: Found input variables with inconsistent numbers of samples:
and after i did some search i found out that sklearn requires the data shape of (row number, column number) so fit method to
model = model.fit(X.reshape(X.shape[0], 1), y)
as you seeing above but i am getting new error says
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'reshape'
data set features
# Attribute Domain
-- -----------------------------------------
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow you'd harm the vibrancy of this discussion community.
my code was correct except that i had a typo when i am trying to assign
y so i changed
y=[['Type']]
to
y=d[['Type']]
also note that scikit learn will throw warning because i am passing column vector instead of id array but you can solve this by change
model.fit(X, y) to
model.fit(X, y.values.ravel())
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)

Categories

Resources