How to use Chi-square test in dataset with negative values

How to use Chi-square test in dataset with negative values - python

I could not fully explain the title. In order to use the Chi-square test in my dataset, I am finding the smallest value and add each cell with that value. (for example, the range of data here is [-8,11] so I added +8 to each cell and the range turned to [0,19]).
for i in range(len(dataValues.index)):
for j in range(len(dataValues.columns)):
dataValues.iat[i, j] += 8
After Preprocessing:
for i in range(len(dataValues.index)):
for x in topFeatures:
finalDataFrame.at[i, x] = dataValues.at[i, x] - 8
But this causes performance problems. Another solution I'm thinking of is to normalize it. I wrote a function that looks like this:
def normalization(df):
from sklearn import preprocessing
x = df.values # returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled, columns=df.columns)
return df
My program has accelerated lots, but this time my accuracy has decreased. The feature selection process I have done with the first method produces 0.85 accuracy results, this time I am producing 0.70 accuracy.
I want to get rid of this primitive method, but I also want accuracy to remain constant. How do I proceed?
Thank you in advance.

Related

StandardScaler.inverse_transform() return the same array as input :/ Is sklearn broken or am I?

Good evening,
I'm currently pursuing a PhD in chemistry and in this framework I'm trying to apply my few knowledge in python and stats to discriminate sample based on their IR spectrum.
After a few of weeks of data acquisition I'm finally able to build my data set and was about to see what PCA can offer (this was the easy part).
I was able to build my script and get the loadings, scores and everything else that I could possibly need or want. However I used the StandardScaler from sklearn.preprocessing to scale down my data so (correct my if i'm wrong) I should get back loadings in this "standard scaled" space.
As my data are actual IR spectra those loadings have a chemical meanings (even thought there are not real spectrum) e.g. if my PC1 loadings have a peak at XX cm-1 i know that samples with high PC1 are likely to contain compounds that absorb at this wavenumber .
So i want to reverse the StandardScaler transformation. I've tried to used StandardScaler.inverse_transform() however it appears to return me the same array that I gave him... which is very frustrating...
I'm trying to do the same thing with my samples spectrum but it gave me the same result again : here is the portion of my script where I tried this :
Wavenumbers = DFF.columns
#in fact this is a little more complicated but that's the spirit
Spectre = DFF.values.tolist()
#btw DFF is my pandas.dataframe containing spectrum with features = wavenumber
SS = StandardScaler(copy=True)
DFF = SS.fit_transform(DFF) #at this point I use SS for preprocessing before PCA
#I'm then trying to inverse SS and get back the 1rst spectrum of the dataset
D = SS.inverse_transform(DFF[0])
#However at this point DFF[0] and D are almost-exactly the same I'm sure because :
plt.plot(Wavenumbers,D)
plt.plot(Wavenumbers,DFF[0]) #the curves are the sames, and :
for i,j in enumerate(D) :
if j==DFF[0][i] : pass
else : print("{}".format(j-DFF[0][i] )) #return nothing bigger than 10e-16
The problem is more than likely syntax or how i used StandardScaler, however i have no one around me to search for help with that . Can anyone tell me what i did wrong ? or give me an hint on how i could get back my loadings in the "actual real IR spectra" space ?
PS: sorry for the wacky English and i hope to be understandable

Good evening,
After putting the problem aside for a few days I finally re-coded the function I needed (as suggested by Robert Dodier).
For reminder, I wanted to have a function that could take my data from a pandas dataframe and mean-centered it in order to do PCA, but also that could reverse the preprocessing for latter uses.
Here is the code I ended up with :
import pandas as pd
import numpy as np
class Scaler:
std =[]
mean = []
def fit(self,DF):
self.std=[]
self.mean=[]
for c in DF.columns:
self.std.append(DF[c].std())
self.mean.append(DF[c].mean())
def transform(self,DF):
X = np.zeros(shape=DF.shape)
for i,c in enumerate(DF.columns):
for j in range(len(DF.index)):
X[j][i] = (DF[c][j] - self.mean[i]) / self.std[i]
return X
def reverse(self,X):
Y = np.zeros(shape=X.shape)
for i in range(len(X[0])):
for j in range(len(X)):
Y[j][i] = X[j][i] * self.std[i] + self.mean[i]
return Y
def fit_transform(self,DF):
self.fit(DF)
X = self.transform(DF)
return X
It's pretty slow and surely very low-tech but it seems to do the job just fine. Hope it will save some time to other python beginners.
I designed it to be as close as I think sklearn.preprocessing.StandardScaler does it.
example :
S = Scaler() #create scaler object
S.fit(DF) #fit the scaler to the dataframe (calculate mean and std for every columns in DF /!\ DF must be a pd.dataframe)
X=S.transform(DF) # return a np.array with mean centered data
Y = S.reverse(X) # reverse the transformation to get back original data
Again sorry for the fast tipped English. And thanks to Robert for taking the time to answer.

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred

The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

What is the purpose of keras utils normalize?

I'd like to normalize my training set before passing it to my NN so instead of doing it manually (subtract mean and divide by std), I tried keras.utils.normalize() and I am amazed about the results I got.
Running this:
r = np.random.rand(3000) * 1000
nr = normalize(r)
print(np.mean(r))
print(np.mean(nr))
print(np.std(r))
print(np.std(nr))
print(np.min(r))
print(np.min(nr))
print(np.max(r))
print(np.max(nr))

Results in that:
495.60440066771866
0.015737914577213984
291.4440194021
0.009254802974329002
0.20755517410064872
6.590913227674956e-06
999.7631481267636
0.03174747238214018
Unfortunately, the docs don't explain what's happening under the hood. Can you please explain what it does and if I should use keras.utils.normalize instead of what I would have done manually?

It is not the kind of normalization you expect. Actually, it uses np.linalg.norm() under the hood to normalize the given data using Lp-norms:
def normalize(x, axis=-1, order=2):
"""Normalizes a Numpy array.
# Arguments
x: Numpy array to normalize.
axis: axis along which to normalize.
order: Normalization order (e.g. 2 for L2 norm).
# Returns
A normalized copy of the array.
"""
l2 = np.atleast_1d(np.linalg.norm(x, order, axis))
l2[l2 == 0] = 1
return x / np.expand_dims(l2, axis)
For example, in the default case, it would normalize the data using L2-normalization (i.e. the sum of squared of elements would be equal to one).
You can either use this function, or if you don't want to do mean and std normalization manually, you can use StandardScaler() from sklearn or even MinMaxScaler().

How do I find the euclidean distances between rows of my test and train set efficiently?

I have test and train sets with the following dimensions with all features (i.e. columns) as integers.
X_train.shape
(990188L, 19L)
X_test.shape
(424367L, 19L)
I want to find out the euclidean distance among all the rows of train set and all the rows of the test set.
I have to also remove the rows from the train set with a distance threshold of 0.005.
I have a following linear code which is too slow but works fine.
for a in range(X_test.shape[0]):
a_test = np_Test[a]
for b in range(X_train.shape[0]):
a_train = np_Train[b]
if(a != b):
dst = distance.euclidean(a_test, a_train)
if(dst <= 0.005):
train.append(b)
where I note down the indexes of the rows that lie within the distance threshold.
Is there any way to parallelize this code?
I tried using from sklearn.metrics.pairwise import euclidean_distances
but as the data set is huge, I am getting a memory error.
I tried to parallelize the code by using euclidean_distances is batches but some how I think the following code is not working fine.
Please help me if there is any way to parallelize the code.
rows = X_train.shape[0]
rem = rows%1000
no = rows/1000
i = 0
while (i <= no*1000) :
dst_mat = euclidean_distances(X_train[i:i+1000, :], X_test)
condition = np.any(dst_mat <= 0.005, axis = 1)
index = np.where(condition == True)
index = np.add(index, i)
print(index)
print(dst_mat)
i+=1000

Use scipy.spatial.cdist. This will calculate the pairwise distance.
Thanks to Warren Weckesser for pointing out this solution.

Pandas/Statsmodel OLS predicting future values

I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085

I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
To predict future values, just pass new data to .predict() For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use Chi-square test in dataset with negative values - python

Related

StandardScaler.inverse_transform() return the same array as input :/ Is sklearn broken or am I?

Outlier detection with Local Outlier Factor (LOF)

What is the purpose of keras utils normalize?

How do I find the euclidean distances between rows of my test and train set efficiently?

Pandas/Statsmodel OLS predicting future values

Categories

Resources