I have before made Normal curves using the amazing blog by Robin Kennedy:
https://public.tableau.com/en-us/s/blog/2013/11/fitting-normal-curve-histogram
But when it comes to TabPy, I am failing to do so. As using Python codes inside Tableau needs some tweaking to do, some of the basic functions of Python fail in Tableau.
Even if I go step by step with the blog and manipulating the code as TabPy needs me to work, the final formula to make the curve -
(1 / ([St Dev] * SQRT(2*PI()))) * EXP(-((ATTR([Profit Bin]) – [Mean])^2 / (2*[St Dev]^2))) * [Profit Bin Size]* TOTAL (SUM([Number of Records]))
is rather getting difficult for me to recreate.
What I wrote till now is -
SCRIPT_REAL("
import numpy as np
import matplotlib.mlab as mlab
import math
sigma = maths.sqrt(_arg1)
x = np.linspace(_arg2 - 3*_arg3, _arg2 + 3*_arg3, 100)
", FLOAT([Variance]), FLOAT([Mean]), FLOAT([Std Dev]) )
I am clueless how to proceed next, I mean how to plot that out. Any advice in this?
I have made the histogram using Tableau bins but I need to make the curve using Python.
Related
I am trying to write a script in python to detect the existence of a simple alarm sound in any given input audio file. I explain my solution and I appreciate it if anyone can confirm it is a good solution. Any other solution implementable in python is appreciated.
The way I do this is calculating cross correlation of the two signals by calculating FFT of both signals (one is reversed), and multiplying them together and then calculating IFFT of the result. Then finding the peak of the result and comparing it with a pre-specified threshold would determine if the alarm sound is detected or not.
This is my code:
import scipy.fftpack as fftpack
def similarity(template, test):
corr = fftpack.irfft(fftpack.rfft(test , 2 * test.size ) * \
fftpack.rfft(template[:-1] , 2 * template.size ))
return max(abs(corr))
template and test are the 1-D lists of signal data. The second argument to rfft is used to pad zeros for calculating FFT. however, I am not sure how many zeros should be added. Also, should I do any normalisation o the given signal before applying FFT? for example, normalizing it based on the peak of template signal?
Solved!
I just needed to use scipy.signal.fftconvolve which takes care of zero padding itself. No normalization was required. So the working code for me is:
from scipy.signal import fftconvolve
def similarity(template, test):
corr = fftconvolve(template, test, mode='same')
return max(abs(corr))
I'm fairly new to programming, but this problem happens in python and in excel as well.
I'm using the following formulas for the RC transfer function
s/(s+1) for High Pass
1/(s+1) for Low Pass
with s = jwRC
below is the code I used in python
from pylab import *
from numpy import *
from cmath import *
"""
Generating a transfer function for RC filters.
Importing modules for complex math and plotting.
"""
f = arange(1, 5000, 1)
w = 2.0j*pi*f
R=100
C=1E-5
hp_tf = (w*R*C)/(w*R*C+1) # High Pass Transfer function
lp_tf = 1/(w*R*C+1) # Low Pass Transfer function
plot(f, hp_tf) # plot high pass transfer function
plot(f, lp_tf, '-r') # plot low pass transfer function
xscale('log')
I can't post images yet so I can't show the plot. But the issue here is the cutoff frequency is different for each one. They should cross at y=0.707, but they actually cross at about 0.5.
I figure my formula or method is wrong somewhere, but I can't find the mistake can anyone help me out?
Also, on a related note, I tried to convert to dB scale and I get the following error:
TypeError: only length-1 arrays can be converted to Python scalars
I'm using the following
debl=20*log(hp_tf)
This is a classical example why you should avoid pylab and more generally imports of the form
from module import *
unless you know exactly what it does, since it hopelessly clutters the name space.
Using,
import matplotlib.pyplot as plt
import numpy as np
and then calling np.log and plt.plot etc. will solve your problem.
Furether explanations
What's happening here is that,
from pylab import *
defines a log function from numpy that operate on arrays (the one you want).
However, the later import,
from cmath import *
overwrites it with a version that only accepts scalars, hence your error.
I am fairly new to python and trying to transfer some code from matlab to python. I am trying to optimize a function in python using fmin_bfgs. I always try to vectorize the code when possible, but I ran into the following problem that I can't figure out. Here is a test example.
from pylab import *
from scipy.optimize import fmin_bfgs
## Create some linear data
L=linspace(0,10,100).reshape(100,1)
n=L.shape[0]
M=2*L+5
L=hstack((ones((n,1)),L))
m=L.shape[0]
## Define sum of squared errors as non-vectorized and vectorized
def Cost(theta,X,Y):
return 1.0/(2.0*m)*sum((theta[0]+theta[1]*X[:,1:2]-Y)**2)
def CostVec(theta,X,Y):
err=X.dot(theta)-Y
resid=err**2
return 1.0/(2.0*m)*sum(resid)
## Initialize the theta
theta=array([[0.0], [0.0]])
## Run the minimization on the two functions
print fmin_bfgs(Cost, x0=theta,args=(L,M))
print fmin_bfgs(CostVec, x0=theta,args=(L,M))
The first answer, with the unvectorized function, gives the correct answer which is just the vector [5, 2]. But, the the second answer, using the vectorizied form of the cost function returns roughly [15,0]. I have figured out the 15 doesn't appear from nowhere as it is 2 times the mean of the data plus the intercept, i.e., $2\times 5+5$. Any help is greatly appreciated.
I am interested in using python to compute a confidence interval from a student t.
I am using the StudentTCI() function in Mathematica and now need to code the same function in python http://reference.wolfram.com/mathematica/HypothesisTesting/ref/StudentTCI.html
I am not quite sure how to build this function myself, but before I embark on that, is this function in python somewhere? Like numpy? (I haven't used numpy and my advisor advised not using numpy if possible).
What would be the easiest way to solve this problem? Can I copy the source code from the StudentTCI() in numpy (if it exists) into my code as a function definition?
edit: I'm going to need to build the Student TCI using python code (if possible). Installing scipy has turned into a dead end. I am having the same problem everyone else is having, and there is no way I can require Scipy for the code I distribute if it takes this long to set up.
Anyone know how to look at the source code for the algorithm in the scipy version? I'm thinking I'll refactor it into a python definition.
I guess you could use scipy.stats.t and its interval method:
In [1]: from scipy.stats import t
In [2]: t.interval(0.95, 10, loc=1, scale=2) # 95% confidence interval
Out[2]: (-3.4562777039298762, 5.4562777039298762)
In [3]: t.interval(0.99, 10, loc=1, scale=2) # 99% confidence interval
Out[3]: (-5.338545334351676, 7.338545334351676)
Sure, you can make your own function if you like. Let's make it look like in Mathematica:
from scipy.stats import t
def StudentTCI(loc, scale, df, alpha=0.95):
return t.interval(alpha, df, loc, scale)
print StudentTCI(1, 2, 10)
print StudentTCI(1, 2, 10, 0.99)
Result:
(-3.4562777039298762, 5.4562777039298762)
(-5.338545334351676, 7.338545334351676)
I want to calculate logistic regression parameters using R's glm package. I'm working with python and using rpy2 for that.
For some reason, when I'm running the glm function using R I get much faster results than by using rpy2. Do you know why the calculations using rpy2 is much slower?
I'm using R - V2.13.1 and rpy2 - V2.0.8
Here is the code I'm using:
import numpy
from rpy2 import robjects as ro
import rpy2.rlike.container as rlc
def train(self, x_values, y_values, weights):
x_float_vector = [ro.FloatVector(x) for x in numpy.array(x_values).transpose()]
y_float_vector = ro.FloatVector(y_values)
weights_float_vector = ro.FloatVector(weights)
names = ['v' + str(i) for i in xrange(len(x_float_vector))]
d = rlc.TaggedList(x_float_vector + [y_float_vector], names + ['y'])
data = ro.RDataFrame(d)
formula = 'y ~ '
for x in names:
formula += x + '+'
formula = formula[:-1]
fit_res = ro.r.glm(formula=ro.r(formula), data=data, weights=weights_float_vector, family=ro.r('binomial(link="logit")'))
Without the full R code you are benchmarking against, it is difficult to precisely point out where the problem might be.
You might want to run this through a Python profiler to see where the bottleneck(s) is (are).
Finally, the current release for rpy2 is 2.2.6. Beside API changes, it is running faster and has (presumably) less bugs than 2.0.8.
Edit: From your comments I am now suspecting that you are calling your function
in a loop, and a large fraction of the time is spent building R vectors (that might only have to be built once).