Pandas vectorize statistical odds-ratio test

Pandas vectorize statistical odds-ratio test - python

I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neither site x & y (b) altered in site x, not in y (c) altered in y, not in x (d) altered in both. I'd also like to calculate the fisher exact test to determine statistical significance. The scipy function fisher_exact can calculate both of these (see below).
#here's a sample of my original dataframe
sample_id_no var_col
0 258.0
1 -24.0
2 -150.0
3 149.0
4 108.0
5 -126.0
6 -83.0
7 2.0
8 -177.0
9 -171.0
10 -7.0
11 -377.0
12 -272.0
13 66.0
14 -13.0
15 -7.0
16 0.0
17 189.0
18 7.0
13 -21.0
19 80.0
20 -14.0
21 -76.0
3 83.0
22 -182.0
import pandas as pd
import numpy as np
from scipy.stats import fisher_exact
import itertools
#create a dataframe with each possible pair of variable
var_pairs = pd.DataFrame(list(itertools.combinations(df.var_col.unique(),2) )).rename(columns = {0:'alpha_site', 1: 'beta_site'})
#create a cross-tab with samples and vars
sample_table = pd.crosstab(df.sample_id_no, df.var_col)
odds_ratio_results = var_pairs.apply(getOddsRatio, axis=1, args = (sample_table,))
#where the function getOddsRatio is defined as:
def getOddsRatio(pairs, sample_table):
alpha_site, beta_site = pairs
oddsratio, pvalue = fisher_exact(pd.crosstab(sample_table[alpha_site] > 0, sample_table[beta_site] > 0))
return ([oddsratio, pvalue])
This code runs very slow, especially when used on large datasets. In my actual dataset, I have around 700k variable pairs. Since the getOddsRatio() function is applied to each pair individually, it is definitely the main source of the slowness. Is there a more efficient solution?

Related

How to create a sine curve of positive part only between two integer values

I have to generate a sine curve of the positive part only between two values. The idea is my variable say monthly-averaged RH, which has 12 data points in a year (i.e. time series) varies between 50 and 70 in a sinusoidal way. The first and the last data points end at 50.
Can anyone help how I can generate this curve/function for the curve to get values of all intermediate data points? I am trying to use numpy/scipy for this.
Best,
Debayan

This is basic trig.
import math
for i in range(12):
print( i, 50 + 20 * math.sin( math.pi * i / 12 ) )
Output:
0 50.0
1 55.17638090205041
2 60.0
3 64.14213562373095
4 67.32050807568876
5 69.31851652578136
6 70.0
7 69.31851652578136
8 67.32050807568878
9 64.14213562373095
10 60.0
11 55.17638090205042

How to normalize the data in a dataframe in the range [0,1]?

I'm trying to implement a paper where PIMA Indians Diabetes dataset is used. This is the dataset after imputing missing values:
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age Outcome
0 1 148.0 72.000000 35.00000 155.548223 33.600000 0.627 50 1
1 1 85.0 66.000000 29.00000 155.548223 26.600000 0.351 31 0
2 1 183.0 64.000000 29.15342 155.548223 23.300000 0.672 32 1
3 1 89.0 66.000000 23.00000 94.000000 28.100000 0.167 21 0
4 0 137.0 40.000000 35.00000 168.000000 43.100000 2.288 33 1
5 1 116.0 74.000000 29.15342 155.548223 25.600000 0.201 30 0
The description:
df.describe()
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age
count768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean0.855469 121.686763 72.405184 29.153420 155.548223 32.457464 0.471876 33.240885
std 0.351857 30.435949 12.096346 8.790942 85.021108 6.875151 0.331329 11.760232
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000
25% 1.000000 99.750000 64.000000 25.000000 121.500000 27.500000 0.243750 24.000000
50% 1.000000 117.000000 72.202592 29.153420 155.548223 32.400000 0.372500 29.000000
75% 1.000000 140.250000 80.000000 32.000000 155.548223 36.600000 0.626250 41.000000
max 1.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000
The description of normalization from the paper is as follows:
As part of our data preprocessing, the original data values are scaled so as to fall within a small speciﬁed range of [0,1] values by performing normalization of the dataset. This will improve speed and reduce runtime complexity. Using the Z-Score we normalize our value set V to obtain a new set of normalized values V’ with the equation below:
V'=V-Y/Z
where V’= New normalized value, V=previous value, Y=mean and Z=standard deviation
z=scipy.stats.zscore(df)
But when I try to run the code above, I'm getting negative values and values greater than one i.e., not in the range [0,1].

There are several points to note here.
Firstly, z-score normalisation will not result in features in the range [0, 1] unless the input data has very specific characteristics.
Secondly, as others have noted, two of the most common ways of normalising data are standardisation and min-max scaling.
Set up data
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')
# For the purposes of this exercise, we'll just use the alphabet as column names
df.columns = list(string.ascii_lowercase)[:len(df.columns)]
$ print(df.head())
a b c d e f g h i
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.672 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0
Standardisation
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {standardised.min().min():4.3f} Max: {standardised.max().max():4.3f}")
Min: -4.055 Max: 845.307
As you can see, the values are far from being in [0, 1]. Note the range of the resulting data from z-score normalisation will vary depending on the distribution of the input data.
Min-max scaling
min_max = (df - df.values.min()) / (df.values.max() - df.values.min())
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {min_max.min().min():4.3f} Max: {min_max.max().max():4.3f}")
Min: 0.000 Max: 1.000
Here we do indeed get values in [0, 1].
Discussion
These and a number of other scalers exist in the sklearn preprocessing module. I recommend reading the sklearn documentation and using these instead of doing it manually, for various reasons:
There are fewer chances of making a mistake as you have to do less typing.
sklearn will be at least as computationally efficient and often more so.
You should use the same scaling parameters from training on the test data to avoid leakage of test data information. (In most real world uses, this is unlikely to be significant but it is good practice.) By using sklearn you don't need to store the min/max/mean/SD etc. from scaling training data to reuse subsequently on test data. Instead, you can just use scaler.fit_transform(X_train) and scaler.transform(X_test).
If you want to reverse the scaling later on, you can use scaler.inverse_transform(data).
I'm sure there are other reasons, but these are the main ones that come to mind.

Your standardization formula hasn't the aim of putting values in the range [0, 1].
If you want to normalize data to make it in such a range, you can use the following formula :
z = (actual_value - min_value_in_database)/(max_value_in_database - min_value_in_database)
And sir, you're not oblige to do it manually, just use sklearn library, you'll find different standardization and normalization methods in the preprocessing section.

Assuming your original dataframe is df and it has no invalid float values, this should work
df2 = (df - df.values.min()) / (df.values.max()-df.values.min())

Optimized method for mapping contents of a column in a 2D numpy array

I have a numpy 2D array containing integers between 0 to 100. For a particular column, I want to map the values in the following way:
0-4 mapped to 0
5-9 mapped to 5
10-14 mapped to 10, and so on.
This is my code:
import numpy as np
#profile
def map_column(arr,col,incr):
col_data = arr[:,col]
vec = np.arange(0,100,incr)
for i in range(col_data.shape[0]):
for j in range(len(vec)-1):
if (col_data[i]>=vec[j] and col_data[i]<vec[j+1]):
col_data[i] = vec[j]
if (col_data[i]>vec[-1]):
col_data[i] = vec[-1]
return col_data
np.random.seed(1)
myarr = np.random.randint(100,size=(80000,4))
x = map_column(myarr,2,5)
This code takes 8.3 seconds to run. The following is the output of running line_profiler on this code.
Timer unit: 1e-06 s
Total time: 8.32155 s
File: testcode2.py
Function: map_column at line 2
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 #profile
3 def map_column(arr,col,incr):
4 1 17.0 17.0 0.0 col_data = arr[:,col]
5 1 34.0 34.0 0.0 vec = np.arange(0,100,incr)
6 80001 139232.0 1.7 1.7 for i in range(col_data.shape[0]):
7 1600000 2778636.0 1.7 33.4 for j in range(len(vec)-1):
8 1520000 4965687.0 3.3 59.7 if (col_data[i]>=vec[j] and col_data[i]<vec[j+1]):
9 76062 207492.0 2.7 2.5 col_data[i] = vec[j]
10 80000 221693.0 2.8 2.7 if (col_data[i]>vec[-1]):
11 3156 8761.0 2.8 0.1 col_data[i] = vec[-1]
12 1 2.0 2.0 0.0 return col_data
In future I have to work with real data much bigger than this one.
Can anyone please suggest a faster method to do this?

I think this can be solved with an arithmetic expression, if I understand the question correctly:
def map_column(arr,col,incr):
col_data = arr[:,col]
return (col_data//incr)*incr
should do the trick. What happens here is that due to the integer division, the remainder is discarded. Thus, multiplying again with the increment, you get the next smaller number that is divisible by the increment.

Appropriate handling of Pandas dataframe scatterplot with varying colors and marker sizes

Given a dataframe df with columns A, B, C, and D,
A B C D
0 88 38 15.66 30.0
1 88 34 15.66 40.0
2 15 15 12.00 20.0
3 15 19 8.00 15.0
4 45 12 6.00 15.0
5 45 30 4.00 30.0
6 29 31 3.60 15.0
7 88 20 3.60 10.0
8 64 25 3.60 15.0
9 45 43 3.60 20.0
I want to make a scatter plot that graphs A vs B, with sizes based on C and colors based on D. After trying many ways to do this, I settled on grouping the data by D, then plotting each group in D:
fig,axes=plt.subplots()
factor=df.groupby('D')
for name, group in factor:
axes.scatter(group.A,group.B,s=(group.C)**2,c=group.D,
cmap='viridis',norm=Normalize(vmin=min(df.D),vmax=max(df.D)),label=name)
This yields the appropriate result, but the default legend() function is wrong. The groups listed in the legend have correct names, but incorrect colors and sizes (colors should vary by group, and sizes of all markers should be the same).
I tried to manually set the legend, which I can approximate the colors but can't get the sizes to be equal. Eventually I'd like a second legend that will link sizes to the appropriate levels of C.
axes.legend(loc=1,scatterpoints=1,fontsize='small',frameon=False,ncol=2)
leg=axes.get_legend()
for i in range(len(factor)):
z=plt.cm.viridis(np.linspace(0,1,len(factor)))
leg.legendHandles[i].set_color(z[i])

Here's one approach that seems to satisfy your requirements, using Seaborn's lmplot(). (Inspiration taken from this post.)
First, generate some sample data:
import numpy as np
import pandas as pd
n = 10
min_size = 50
max_size = 300
A = np.random.random(n)
B = np.random.random(n)*2
C = np.random.randint(min_size, max_size, size=n)
D = np.random.choice(['Group1','Group2'], n)
df = pd.DataFrame({'A':A,'B':B,'C':C,'D':D})
Now plot:
import seaborn as sns
sns.lmplot(x='A', y='B', hue='D',
fit_reg=False, data=df,
scatter_kws={'s':df.C})
UPDATE
Given updated example data from OP, the same lmplot() approach should fulfill specifications: group legend is tracked by color, size of legend indicators is equal.
sns.lmplot(x='A', y='B', hue='D', data=df,
scatter_kws={'s':df.C**2}, fit_reg=False,)

Is numpy.argmax slower than MATLAB [~,idx] = max()?

I am writing a Bayseian classifier for a normal distribution. I have both code in python and MATLAB which are nearly identical. However the MATLAB code runs about 50x faster than my Python script. I'm new to Python, so maybe there's something I did terribly wrong. I assume it's somewhere where I loop over the dataset.
Possibly numpy.argmax() is much slower than [~,idx]=max()? Looping through the data frame is slow? Bad use of dictionaries (previously I tried an object and it was even slow)?
Any advice is welcome.
Python code
import numpy as np
import pandas as pd
#import the data as a data frame
train_df = pd.read_table('hw1_traindata.txt',header = None)#training
train_df.columns = [1, 2] #rename column titles
The data here is a 2 columns (300 rows/samples for training and 300000 for test). This is the function parameters; mi and Si are the sample means and covariances.
case3_p = {'w': [], 'w0': [], 'W': []}
case3_p['w']={1:S1.I*m1,2:S2.I*m2,3:S3.I*m3}
case3_p['w0']={1: -1.0/2.0*(m1.T*S1.I*m1)-
1.0/2.0*np.log(np.linalg.det(S1)),
2: -1.0/2.0*(m2.T*S2.I*m2)-1.0/2.0*np.log(np.linalg.det(S2)),
3: -1.0/2.0*(m3.T*S3.I*m3)-1.0/2.0*np.log(np.linalg.det(S3))}
case3_p['W']={1: -1.0/2.0*S1.I,
2: -1.0/2.0*S2.I,
3: -1.0/2.0*S3.I}
#W1=-1.0/2.0*S1.I
#w1_3=S1.I*m1
#w01_3=-1.0/2.0*(m1.T*S1.I*m1)-1.0/2.0*np.log(np.linalg.det(S1))
def g3(x,W,w,w0):
return x.T*W*x+w.T*x+w0
This is the classifier/loop
train_df['case3'] = 0
for i in range(train_df.shape[0]):
x = np.mat(train_df.loc[i,[1, 2]]).T#observation
#case 3
vals = [g3(x,case3_p['W'][1],case3_p['w'][1],case3_p['w0'][1]),
g3(x,case3_p['W'][2],case3_p['w'][2],case3_p['w0'][2]),
g3(x,case3_p['W'][3],case3_p['w'][3],case3_p['w0'][3])]
train_df.loc[i,'case3'] = np.argmax(vals) + 1 #add one to make it the class value
Corresponding MATLAB code
train = load('hw1_traindata.txt');
The discriminant functions
W1=-1/2*S1^-1;%there isn't one for the other cases
w1_3=S1^-1*m1';%fix the transpose thing
w10_3=-1/2*(m1*S1^-1*m1')-1/2*log(det(S1));
g1_3=#(x) x'*W1*x+w1_3'*x+w10_3';
W2=-1/2*S2^-1;
w2_3=S2^-1*m2';
w20_3=-1/2*(m2*S2^-1*m2')-1/2*log(det(S2));
g2_3=#(x) x'*W2*x+w2_3'*x+w20_3';
W3=-1/2*S3^-1;
w3_3=S3^-1*m3';
w30_3=-1/2*(m3*S3^-1*m3')-1/2*log(det(S3));
g3_3=#(x) x'*W3*x+w3_3'*x+w30_3';
The classifier
case3_class_tr = Inf(size(act_class_tr));
for i=1:length(train)
x=train(i,:)';%current sample
%case3
vals = [g1_3(x),g2_3(x),g3_3(x)];%compute discriminant function value
[~, case3_class_tr(i)] = max(vals);%get location of max
end

In cases such as this it's best to profile your code. First I created some mock data:
import numpy as np
import pandas as pd
fname = 'hw1_traindata.txt'
ar = np.random.rand(1000, 2)
np.savetxt(fname, ar, delimiter='\t')
m1, m2, m3 = [np.mat(ar).T for ar in np.random.rand(3, 2)]
S1, S2, S3 = [np.mat(ar) for ar in np.random.rand(3, 2, 2)]
Then I wrapped your code in a function and profiled with the lprun (line_profiler) IPython magic. These are the results:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 4.77361 s
File: <ipython-input-164-563f57dadab3>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
=====================================================
1 def train(fname, m1, S1, m2, S2, m3, S3):
2 1 9868 9868.0 0.1 train_df = pd.read_table(fname ,header = None)#training
3 1 328 328.0 0.0 train_df.columns = [1, 2] #rename column titles
4
5 1 17 17.0 0.0 case3_p = {'w': [], 'w0': [], 'W': []}
6 1 877 877.0 0.0 case3_p['w']={1:S1.I*m1,2:S2.I*m2,3:S3.I*m3}
7 1 356 356.0 0.0 case3_p['w0']={1: -1.0/2.0*(m1.T*S1.I*m1)-
8
9 1 204 204.0 0.0 1.0/2.0*np.log(np.linalg.det(S1)),
10 1 498 498.0 0.0 2: -1.0/2.0*(m2.T*S2.I*m2)-1.0/2.0*np.log(np.linalg.det(S2)),
11 1 502 502.0 0.0 3: -1.0/2.0*(m3.T*S3.I*m3)-1.0/2.0*np.log(np.linalg.det(S3))}
12 1 235 235.0 0.0 case3_p['W']={1: -1.0/2.0*S1.I,
13 1 229 229.0 0.0 2: -1.0/2.0*S2.I,
14 1 230 230.0 0.0 3: -1.0/2.0*S3.I}
15
16 1 1818 1818.0 0.0 train_df['case3'] = 0
17
18 1001 17409 17.4 0.2 for i in range(train_df.shape[0]):
19 1000 4254511 4254.5 49.9 x = np.mat(train_df.loc[i,[1, 2]]).T#observation
20
21 #case 3
22 1000 298245 298.2 3.5 vals = [g3(x,case3_p['W'][1],case3_p['w'][1],case3_p['w0'][1]),
23 1000 269825 269.8 3.2 g3(x,case3_p['W'][2],case3_p['w'][2],case3_p['w0'][2]),
24 1000 274279 274.3 3.2 g3(x,case3_p['W'][3],case3_p['w'][3],case3_p['w0'][3])]
25 1000 3395654 3395.7 39.8 train_df.loc[i,'case3'] = np.argmax(vals) + 1
26
27 1 45 45.0 0.0 return train_df
There are two lines that together take 90% of the time. So let's split these lines up a bit and rerun the profiler:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 6.15358 s
File: <ipython-input-197-92d9866b57dc>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
======================================================
...
19 1000 5292988 5293.0 48.2 thing = train_df.loc[i,[1, 2]] # Observation
20 1000 265101 265.1 2.4 x = np.mat(thing).T
...
26 1000 143142 143.1 1.3 index = np.argmax(vals) + 1 # Add one to make it the class value
27 1000 4164122 4164.1 37.9 train_df.loc[i,'case3'] = index
Most time is spent indexing the Pandas dataframe! Taking the argmax is only 1.5% of total execution time.
The situation can be improved somewhat by pre-allocating train_df['case3'] and using .iloc:
%lprun -f train train(fname, m1, S1, m2, S2, m3, S3)
Timer unit: 5.59946e-07 s
Total time: 3.26716 s
File: <ipython-input-192-f6173cdf9990>
Function: train at line 1
Line # Hits Time Per Hit %Time Line Contents
======= ======= ======================================
16 1 1548 1548.0 0.0 train_df['case3'] = np.zeros(len(train_df))
...
19 1000 2608489 2608.5 44.7 thing = train_df.iloc[i,[0, 1]] # Observation
20 1000 228959 229.0 3.9 x = np.mat(thing).T
...
26 1000 123165 123.2 2.1 index = np.argmax(vals) + 1 # Add one to make it the class value
27 1000 1849283 1849.3 31.7 train_df.iloc[i,2] = index
Still though, iterating individual values from Pandas dataframes in tight loops is a bad idea. In this case use Pandas only for loading the text-data (it's very good at it) but other than that use "raw" Numpy arrays. E.g. use train_data = pd.read_table(fname, header=None).values. And when you reach the analysis stage maybe go back to Pandas.
Some other ramblings:
Use Python's zero-based indexing and don't go out of your way to use
one-based indexing.
Consider using normal Numpy arrays instead of matrices. When you use
matrices you tend to mix them up with arrays and run into hard to debug
problems.
MATLAB has a JIT compliler, so a speed difference between Python and
MATLAB is expected for loop heavy code.

It's really hard to tell, but straight out of the package Matlab will be faster than Numpy. Primarily because it comes with its own Math Kernel Library
Whether 50x is a reasonable approximatino it'll be hard to compare basic Numpy vs Matlab's MKL.
There are other Python distribution that come with their own MKL such as Enthought and Anaconda
In Anaconda's MKL Optimizations page you'll see the chart comparing the difference between regular Anaconda and the one with MKL. The improvement is not linear, but definitely there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas vectorize statistical odds-ratio test - python

Related

How to create a sine curve of positive part only between two integer values

How to normalize the data in a dataframe in the range [0,1]?

Optimized method for mapping contents of a column in a 2D numpy array

Appropriate handling of Pandas dataframe scatterplot with varying colors and marker sizes

Is numpy.argmax slower than MATLAB [~,idx] = max()?

Categories

Resources