how to calculate total statistical distance in python

how to calculate total statistical distance in python - python

In this link total variation distance between two probability distribution is given.
I tried to calculate it in python. I have two datasets and firstly I calculated their probability distribution functions from histograms. Then I tried to get max differences of between two distributions. But it returns me very small values. It seems that I am doing a mistake in it. Can you please help to fix it?
import scipy.stats as st
#original data has shape of [45222,1] and it is numpy array
#synthetic data has shape of [45222,1] and it is numpy array
summation = 0
minOriginal = min(original)
minGenerated = min(synthetic)
maxOriginal = max(original)
maxGenerated = max(synthetic)
minHist = min(minOriginal, minGenerated)
maxHist = max(maxOriginal, maxGenerated)
originalHist = np.histogram(original, range=(minHist, maxHist))
hist_dist1 = st.rv_histogram(originalHist)
generatedHist = np.histogram(synthetic, range=(minHist, maxHist))
hist_dist2 = st.rv_histogram(generatedHist)
x = np.linspace(minHist, maxHist, 45000)
summation += max(abs(hist_dist1.pdf(x)-hist_dist2.pdf(x)))

Related

How to calculate risk contribution of assets in Python

I'm trying to write a block of code that will allow me to identify the risk contribution of assets in a portfolio. The covariance matrix is a 6x6 pandas dataframe.
My code is as follows:
import numpy as np
import pandas as pd
weights = np.array([.1,.2,.05,.25,.1,.3])
data = pd.DataFrame(np.random.randn(1000,6),columns = 'a','b','c','d','e','f'])
covariance = data.cov()
portfolio_variance = (weights*covariance*weights.T)[0,0]
sigma = np.sqrt(portfolio_variance)
marginal_risk = covariance*weights.T
risk_contribution = np.multiply(marginal_risk, weights.T)/sigma
print(risk_contribution)
When I try to run the code I get a KeyError, and if I remove the [0,0] from portfolio_variance I get output that doesn't seem to make sense.
Can somebody point me to my error(s)?

Three problems with your code:
Open your list operator square brackets on line 6:
data = pd.DataFrame(np.random.randn(1000,6),columns = ['a','b','c','d','e','f'])
You're using the two dimensional indexing operator wrong. You can't say [0,0], you have to say [0][0].
And last, because you named the columns, you have to use them when indexing, so it's actually ['a'][0]:
portfolio_variance = (weights*covariance*weights.T)['a'][0]
Final working code:
import numpy as np
import pandas as pd
weights = np.array([.1,.2,.05,.25,.1,.3])
data = pd.DataFrame(np.random.randn(1000,6),columns = ['a','b','c','d','e','f'])
covariance = data.cov()
portfolio_variance = (weights*covariance*weights.T)['a'][0]
sigma = np.sqrt(portfolio_variance)
marginal_risk = covariance*weights.T
risk_contribution = np.multiply(marginal_risk, weights.T)/sigma
print(risk_contribution)

portfolio_variance =(weights*covariance*weights.T)
portfolio_variance should be
portfolio_variance =(weights#covariance#weights.T)
This will provide the portfolio variance, which should be a single number.
same for marginal risk, it should be
marginal_risk = covariance#weights.T

Multiple Linear Regression using Python

Firstly, there are a few topics on this but they involve deprecated packages with pandas etc. Suppose I'm trying to predict a variable w with variables x,y and z. I want to run a multiple linear regression to try and predict w. There are quite a few solutions that will produce the coefficients but I'm not sure how to use these. So, in pseudocode;
import numpy as np
from scipy import stats
w = np.array((1,2,3,4,5,6,7,8,9,10)) # Time series I'm trying to predict
x = np.array((1,3,6,1,4,6,8,9,2,2)) # The three variables to predict w
y = np.array((2,7,6,1,5,6,3,9,5,7))
z = np.array((1,3,4,7,4,8,5,1,8,2))
def model(w,x,y,z):
# do something!
return guess # where guess is some 10 element array formed
# using multiple linear regression of x,y,z
guess = model(w,x,y,z)
r = stats.pearsonr(w,guess) # To see how good guess is
Hopefully this makes sense as I'm new to MLR. There is probably a package in scipy that does all this so any help welcome!

You can use the normal equation method.
Let your equation be of the form : ax+by+cz +d =w
Then
import numpy as np
x = np.asarray([[1,3,6,1,4,6,8,9,2,2],
[2,7,6,1,5,6,3,9,5,7],
[1,3,4,7,4,8,5,1,8,2],
[1,1,1,1,1,1,1,1,1,1]]).T
y = numpy.asarray([1,2,3,4,5,6,7,8,9,10]).T
a,b,c,d = np.linalg.pinv((x.T).dot(x)).dot(x.T.dot(y))

Think I've found out now. If anyone could confirm that this produces the correct results that'd be great!
import numpy as np
from scipy import stats
# What I'm trying to predict
y = [-6,-5,-10,-5,-8,-3,-6,-8,-8]
# Array that stores two predictors in columns
x = np.array([[-4.95,-4.55],[-10.96,-1.08],[-6.52,-0.81],[-7.01,-4.46],[-11.54,-5.87],[-4.52,-11.64],[-3.36,-7.45],[-2.36,-7.33],[-7.65,-10.03]])
# Fit linear least squares and get regression coefficients
beta_hat = np.linalg.lstsq(x,y)[0]
print(beta_hat)
# To store my best guess
estimate = np.zeros((9))
for i in range(0,9):
# y = x1b1 + x2b2
estimate[i] = beta_hat[0]*x[i,0]+beta_hat[1]*x[i,1]
# Correlation between best guess and real values
print(stats.pearsonr(estimate,y))

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?

numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab

Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()

This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

looping through an array to find euclidean distance in python

This is what I have thus far:
Stats2003 = np.loadtxt('/DataFiles/2003.txt')
Stats2004 = np.loadtxt('/DataFiles/2004.txt')
Stats2005 = np.loadtxt('/DataFiles/2005.txt')
Stats2006 = np.loadtxt('/DataFiles/2006.txt')
Stats2007 = np.loadtxt('/DataFiles/2007.txt')
Stats2008 = np.loadtxt('/DataFiles/2008.txt')
Stats2009 = np.loadtxt('/DataFiles/2009.txt')
Stats2010 = np.loadtxt('/DataFiles/2010.txt')
Stats2011 = np.loadtxt('/DataFiles/2011.txt')
Stats2012 = np.loadtxt('/DataFiles/2012.txt')
Stats = Stats2003, Stats2004, Stats2004, Stats2005, Stats2006, Stats2007, Stats2008, Stats2009, Stats2010, Stats2011, Stats2012
I am trying to calculate euclidean distance between each of these arrays with every other array but am having difficulty doing so.
I have the output I would like by calculating the distance like:
dist1 = np.linalg.norm(Stats2003-Stats2004)
dist2 = np.linalg.norm(Stats2003-Stats2005)
dist11 = np.linalg.norm(Stats2004-Stats2005)
etc but I would like to make these calculations with a loop.
I am displaying the calculations into a table using Prettytable.
Can anyone point me in the right direction? I haven't found any previous solutions that have worked.

Look at scipy.spatial.distance.cdist.
From the documentation:
Computes distance between each pair of the two collections of inputs.
So you could do something like the following:
import numpy as np
from scipy.spatial.distance import cdist
# start year to stop year
years = range(2003,2013)
# this will yield an n_years X n_features array
features = np.array([np.loadtxt('/Datafiles/%s.txt' % year) for year in years])
# compute the euclidean distance from each year to every other year
distance_matrix = cdist(features,features,metric = 'euclidean')
If you know the start year, and you aren't missing data for any years, then it's easy to determine which two years are being compared at coordinate (m,n) in the distance matrix.

To do the loop you will need to keep data out of your variable names. A simple solution would be to use dictionaries instead. The loops are implicit in the dict comprehensions:
import itertools as it
years = range(2003, 2013)
stats = {y: np.loadtxt('/DataFiles/{}.txt'.format(y) for y in years}
dists = {(y1,y2): np.linalg.norm(stats[y1] - stats[y2]) for (y1, y2) in it.combinations(years, 2)}
now access stats for a particular year, e.g. 2007, by stats[2007] and distances with tuples e.g. dists[(2007, 20011)].

statistics for histogram of periodic data

For a series of angle values in (-pi, pi) range, I make a histogram. Is there an effective way to calculate a mean and modal (post probable) value? Consider following examples:
import numpy as N, cmath
deg = N.pi/180.
d = N.array([-175., 170, 175, 179, -179])*deg
i = N.sum(N.exp(1j*d))
ave = cmath.phase(i)
i /= float(d.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print ave/deg, stdev/deg
Now, let's have a histogram:
counts, bins = N.histogram(data, N.linspace(-N.pi, N.pi, 360))
Is it possible to calculate mean, mode having counts and bins? For non-periodic data, calculation of a mean is straightforward:
ave = sum(counts*bins[:-1])
Calculations of a modal value requires more effort. Actually, I'm not sure my code below is correct: firstly, I identify bins which occur most frequently and then I calculate an arithmetic mean:
cmax = bins[N.argmax(counts)]
mode = N.mean(N.take(bins, N.nonzero(counts == cmax)[0]))
I have no idea, how to calculate standard deviation from such data, though. One obvious solution to all my problems (at least those described above) is to convert histogram data to a data series and then use it in calculations. This is not elegant, however, and inefficient.
Any hints will be very appreciated.
This is the partial solution I wrote.
import numpy as N, cmath
import scipy.stats as ST
d = [-175, 170.2, 175.57, 179, -179, 170.2, 175.57, 170.2]
deg = N.pi/180.
data = N.array(d)*deg
i = N.sum(N.exp(1j*data))
ave = cmath.phase(i) # correct and exact mean for periodic data
wrong_ave = N.mean(d)
i /= float(data.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
wrong_stdev = N.std(d)
bins = N.linspace(-N.pi, N.pi, 360)
counts, bins = N.histogram(data, bins, normed=False)
# consider it weighted vector addition
nz = N.nonzero(counts)[0]
weight = counts[nz]
i = N.sum(weight * N.exp(1j*bins[nz])/len(nz))
pave = cmath.phase(i) # correct and approximated mean for periodic data
i /= sum(weight)/float(len(nz))
pstdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print
print 'scipy: %12.3f (mean) %12.3f (stdev)' % (ST.circmean(data)/deg, \
ST.circstd(data)/deg)
When run, it gives following results:
mean: 175.840 85.843 175.360
stdev: 0.472 151.785 0.430
scipy: 175.840 (mean) 3.673 (stdev)
A few comments now: the first column gives mean/stdev calculated. As can be seen, the mean agrees well with scipy.stats.circmean (thanks JoeKington for pointing it out). Unfortunately stdev differs. I will look at it later. The second column gives completely wrong results (non-periodic mean/std from numpy obviously does not work here). The 3rd column gives sth I wanted to obtain from the histogram data (#JoeKington: my raw data won't fit memory of my computer.., #dmytro: thanks for your input: of course, bin size will influence result but in my application I don't have much choice, i.e. I have to reduce data somehow). As can be seen, the mean (3rd column) is properly calculated, stdev needs further attention :)

Have a look at scipy.stats.circmean and scipy.stats.circstd.
Or do you only have the histogram counts, and not the "raw" data? If so, you could fit a Von Mises distribution to your histogram counts and approximate the mean and stddev in that way.

Here's how to get an approximation.
Since Var(x) = <x^2> - <x>^2, we have:
meanX = N.sum(counts * bins[:-1]) / N.sum(counts)
meanX2 = N.sum(counts * bins[:-1]**2) / N.sum(counts)
std = N.sqrt(meanX2 - meanX**2)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to calculate total statistical distance in python - python

Related

How to calculate risk contribution of assets in Python

Multiple Linear Regression using Python

Python cross correlation

looping through an array to find euclidean distance in python

statistics for histogram of periodic data

Categories

Resources