Getting several statistics from scipy.stats.binned_statistic

Getting several statistics from scipy.stats.binned_statistic - python

I'm using scipy.stats.binned_statistic to get some useful stats on each chunk of data.
However, this function returns only one statistic (mean, std, or a custom one), and I need two. So right now I'm calling it twice:
stat1, bin_edges1, binnumber1 = stats.binned_statistic(x, values, statistic= function1,bins=nbins)
stat2, bin_edges2, binnumber2 = stats.binned_statistic(x, values, statistic= function2,bins=nbins)
The customs functions can only output a single numerical statiscs... But I feel I'm doing twice the work and there should be a clever way to get my two statistics. Any guesses ?
Thanks !

Related

How to filter unusefull data in a dataset using python?

I have a dataset : temperature and pressure values in different ranges.
I want to filter out all data that deviates more than x% from the "normal" value. This data occurs on process failures .
Extra: the normal value can change over a longer time , so what is a exception at timestamp1 can be normal at timestamp2.
I looked into some noise-filters but i'm not sure this is noise.

You asked two questions.
1.
Tack on a derived column, so it's easy to filter.
For "x%", like five percent, you might use
avg = np.mean(df.pressure)
df['pres_deviation'] = abs(df.pressure - avg) / avg
print(df[df.pres_deviation < .05])
But rather than working with a percentage,
you might find it more natural to work with standard deviations,
filtering out e.g. values more than three standard deviations from the mean.
See
https://en.wikipedia.org/wiki/Standard_score
sklearn StandardScaler
2.
(Extra: the normal value can change over time.)
You could use a window of "most recent 100 samples" to define a smoothed average, store that as an extra column, and it replaces the avg scalar in the calculations above.
More generally you could manually set high / low thresholds as a time series in your data.
The area you're describing is called "change point detection", and we find an extensive literature on it, see e.g. https://paperswithcode.com/task/change-point-detection .
I have used ruptures to good effect, and I recommend it to you.

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.

As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.

Chi-Square test for groups of unequal size

I'd like to apply chi-square test scipy.stats.chisquare. And the total number of observations is different in my groups.
import pandas as pd
data={'expected':[20,13,18,21,21,29,45,37,35,32,53,38,25,21,50,62],
'observed':[19,10,15,14,15,25,25,20,26,38,50,36,30,28,59,49]}
data=pd.DataFrame(data)
print(data.expected.sum())
print(data.observed.sum())
To ignore this is incorrect - right?
Does the default behavior of scipy.stats.chisquare takes this into account? I checked with pen and paper and looks like it doesn't. Is there a parameter for this?
from scipy.stats import chisquare
# incorrect since the number of observations is unequal
chisquare(f_obs=data.observed, f_exp=data.expected)
When I do manual adjustment I get slightly different result.
# adjust actual number of observations
data['obs_prop']=data['observed'].apply(lambda x: x/data['observed'].sum())
data['observed_new']=data['obs_prop']*data['expected'].sum()
# proper way
chisquare(f_obs=data.observed_new, f_exp=data.expected)
Please correct me if I am wrong at some point. Thanks.
ps: I tagged R for additional statistical expertise

Basically this was a different statistical problem - Chi-square test of independence of variables in a contingency table.
from scipy.stats import contingency as cont
chi2, p, dof, exp=cont.chi2_contingency(data)
p

I didn't get the question quite well. However, the way I see it is that you can use scipy.stats.chi2_contingency if you want to compute the independence test between two categorical variable.
Also the scipy.stats.chi2_sqaure can be used to compare observed vs expected. Here the number of categories should be the same. Logicaly a category would get a 0 frequency if there is an observed frequecy but the expeceted frequency does not exist and vice-versa.
Hope this helps

Finding combination of columns which provides best combination based on function return

I have a dataframe with daily returns 6 portfolios (PORT1, PORT2, PORT3, ... PORT6).
I have defined functions for compound annual returns and risk-adjusted returns. I can run this function for any one PORT.
I want to find a combination of portfolios (assume equal weighting) to obtain the highest returns. For example, a combination of PORT1, PORT3, PORT4, and PORT6 may provide the highest risk adjusted return. Is there a method to automatically run the defined function on all combinations and obtain the desired combination?
No code is included as I do not think it is necessary to show the computation used to determine risk adj return.
def returns(PORT):
val = ... [computation of return here for PORT]
return val

Finding the optimal location within a multidimensional space is possible, but people have made fortunes figuring out better ways of achieving exactly this.
The problem at the outset is setting out your possibility space. You've six dimensions, and presumably you want to allocate 1 unit of "stuff" across all those six, such that a vector of the allocations {a,b,c,d,e,f} sums to 1. That's still an infinity of numbers, so maybe we only start off with increments of size 0.10. So 10 increments possible, across 6 dimensions, gives you 10^6 possibilities.
So the simple brute-force method would be to "simply" run your function across the entire parameter space, store the values and pick the best one.
That may not be the answer you want, other methods exist, including randomising your guesses and limiting your results to a more manageable number. But the performance gain is offset with some uncertainty - and some potentially difficult conversations with your clients "What do you mean you did it randomly?!".
To make any guesses at what might be optimal, it would be helpful to have an understanding of the response curves each portfolio has under different circumstances and the sorts of risk/reward profiles you might expect them to operate under. Are they linear, quadratic, or are they more complex? If you can model them mathematically, you might be able to use an algorithm to reduce your search space.
Short (but fundamental) answer is "it depends".

You can do
import itertools
best_return = 0
for r in range(len(PORTS)):
for PORT in itertools.combinations(PORTS,r):
cur_return = returns(PORT)
if cur_return > best_return :
best_return = cur_return
best_PORT = PORT
You can also do
max([max([PORT for PORT in itertools.combinations(PORTS,r)], key = returns)
for r in range(len(PORTS))], key = returns)
However, this is more of an economics question than a CS one. Given a set of positions and their returns and risk, there are explicit formulae to find the optimal portfolio without having to brute force it.

Linear regression with Python

I am studying "Building Machine Learning System With Python (2nd)".
I have a silly doubt in very first chapters' answer part.
According to the book and based on my observation I always get 2nd order polynomial as the best fitting curve.
whenever I train my system with training dataset, I get different Test error for different Polynomial Function.
Thus my parameters of the equation also differs.
But surprisingly, I get approximately same answer every time in the range 9.19-9.99 .
My final hypothesis function each time have different parameters but I get approximately same answer.
Can anyone tell me the reason behind it?
[FYI:I am finding answer for y=100000]
I am sharing the code sample and the output of each iteration.
Here are the errors and the corresponding answers with it:
https://i.stack.imgur.com/alVzU.png
https://i.stack.imgur.com/JVGSm.png
https://i.stack.imgur.com/RB53X.png
Thanks in advance!
def error(f, x, y):
return sp.sum((f(x)-y)**2)
import scipy as sp
import matplotlib.pyplot as mp
data=sp.genfromtxt("web_traffic.tsv",delimiter="\t")
x=data[:,0]
y=data[:,1]
x=x[~sp.isnan(y)]
y=y[~sp.isnan(y)]
mp.scatter(x,y,s=10)
mp.title("web traffic over the month")
mp.xlabel("week")
mp.ylabel("hits/hour")
mp.xticks([w*24*7 for w in range(10)],["week %i"%i for i in range(10)])
mp.autoscale(enable=True,tight=True)
mp.grid(color='b',linestyle='-',linewidth=1)
mp.show()
infletion=int(3.5*7*24)
xa=x[infletion:]
ya=y[infletion:]
f1=sp.poly1d(sp.polyfit(xa,ya,1))
f2=sp.poly1d(sp.polyfit(xa,ya,2))
f3=sp.poly1d(sp.polyfit(xa,ya,3))
print(error(f1,xa,ya))
print(error(f2,xa,ya))
print(error(f3,xa,ya))
fx=sp.linspace(0,xa[-1],1000)
mp.plot(fx,f1(fx),linewidth=1)
mp.plot(fx,f2(fx),linewidth=2)
mp.plot(fx,f3(fx),linewidth=3)
frac=0.3
partition=int(frac*len(xa))
shuffled=sp.random.permutation(list(range(len(xa))))
test=sorted(shuffled[:partition])
train=sorted(shuffled[partition:])
fbt1=sp.poly1d(sp.polyfit(xa[train],ya[train],1))
fbt2=sp.poly1d(sp.polyfit(xa[train],ya[train],2))
fbt3=sp.poly1d(sp.polyfit(xa[train],ya[train],3))
fbt4=sp.poly1d(sp.polyfit(xa[train],ya[train],4))
print ("error in fbt1:%f"%error(fbt1,xa[test],ya[test]))
print ("error in fbt2:%f"%error(fbt2,xa[test],ya[test]))
print ("error in fbt3:%f"%error(fbt3,xa[test],ya[test]))
from scipy.optimize import fsolve
print (fbt2)
print (fbt2-100000)
maxreach=fsolve(fbt2-100000,x0=800)/(7*24)
print ("ans:%f"%maxreach)

Don't do this like that.
Linear regression is more "up to you" than you think.
Start by getting the slope of the line, (#1) average((f(x2)-f(x))/(x2-x))
Then use that answer as M to (#2) average(f(x)-M*x).
Now you have (#1) and (#2) as your regression.
For any type of regression similar to this ex, Polynomial,
you need to subtract the A-Factor (First Factor), by using the n super-delta of f(x) with every one with respect to delta(x). Ex. delta(ax^2+bx+c)/delta(x) gives you a equation with a and b, and from there it works. When doing this take the average every time if there is more entries. Do It like a window on a paper sliding down. Ex. You select entries 1-10, then 2-11,3-12 etc for some crazy awesome regression. You may want to create a matrix API. The best way to handle it, is first create a API that takes a row and a column out first. THEN you fool around with that to automate it. The Ratios of the in-out entries left in only 2 cols, is averaged and is the solution to the coefficient. Then Make a program to take rows out but for example leave row 1 & row 5 (OUTPUT), then row 2,row 5... row 4 and row 5. I wouldn't recommend python for coding this. I recommend C programming, because It prevents you from making dirty arrays that you don't remember. Systems-Theory you need to understand. You must create system-by-system. It is insane to code matrices without building automated sub-systems that are carefully tested. I failed until I worked on it in C, so I already made a 1 time shrinking function that is carefully tested, then built systems to automate getting 1 coefficient, tested that, then automated the repetition of that program to solve it. You won't understand any of this by using python or similar shortcuts. You use them after you realize what they really are. That's how I learned. I still am like how did I code that? I still am amazed. Problem is though, it's unstable above 4x4 (actually 4x5) matrices.
Good Luck,
Misha Taylor

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting several statistics from scipy.stats.binned_statistic - python

Related

How to filter unusefull data in a dataset using python?

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

Chi-Square test for groups of unequal size

Finding combination of columns which provides best combination based on function return

Linear regression with Python

Categories

Resources