How to get scipy.stats.chisquare to function properly - python

I have 2 input files of identical size/shape, however the data they contain has a different resolution and I am looking to perform a chi squared test on them.
The input files are 500 lines long and contain 4 columns delineated by spaces, I am trying to test the second column of each input file against the other.
My code is as follows:
# Import statements
C = pl.loadtxt("input_1.txt")
D = pl.loadtxt("input_2.txt")
col2_C = C[:,1]
col2_D = D[:,1]
f_obs = np.array([col2_C])
f_exp = np.array([col2_D])
chisquare(f_obs, f_exp)
This gives me an error saying:
ValueError: df <= 0
I don't even understand what it is complaining about here.
I have tried several other syntaxes within the script, each of which also resulted in various errors:
This one was found here.
chisquare = f_obs=[col2_C], f_exp=[col2_D])
TypeError: chisquare() takes at least one positional argument
Then I tried
chisquare = f_obs(col2_C), F_exp=[col2_D)
NameError: name 'f_obs' is not defined
I also tried several other syntactical tweaks but nothing to any avail. If anybody could please help me get this running I would appreciate it greatly.
Thanks in advance.

First, be sure you are importing chisquare from scipy.stats. Numpy has the function numpy.random.chisquare, but that does not do a statistical test. It generates samples from a chi-square probability distribution.
So be sure you use:
from scipy.stats import chisquare
There is a second problem.
As slices of the two-dimensional array returned by loadtxt, col2_C and col2_D are one-dimensional numpy arrays, so there is no need to use, for example, np.array([col2_C]) when you pass these to chisquare. Just use col2_C and col2_D directly:
chisquare(col2_C, col2_D)
Wrapping the arrays with np.array like you did is causing the problem. chisquare accepts multidimensional arrays and an axis argument. When you do f_exp = np.array([col2_C]) (with the extra square brackets), f_exp is actually a two-dimensional array, with shape (1, 500). Similarly f_obs has shape (1, 500). The default axis argument of chisquare is 0. So when you called chisquare(f_obs, f_exp), you were asking chisquare to perform 500 chi-square tests, with each test having a single observed and expected value.

Related

Problems using numpy.piecewise

1. The core problem and question
I will provide an executable example below, but let me first walk you through the problem first.
I am using solve_ivp from scipy.integrate to solve an initial value problem (see documentation). In fact I have to call the solver twice, to once integrate forward and once backward in time. (I would have to go unnecessarily deep into my concrete problem to explain why this is necessary, but please trust me here--it is!)
sol0 = solve_ivp(rhs,[0,-1e8],y0,rtol=10e-12,atol=10e-12,dense_output=True)
sol1 = solve_ivp(rhs,[0, 1e8],y0,rtol=10e-12,atol=10e-12,dense_output=True)
Here rhs is the right hand side function of the initial value problem y(t) = rhs(t,y). In my case, y has six components y[0] to y[5]. y0=y(0) is the initial condition. [0,±1e8] are the respective integration ranges, one forward and the other backward in time. rtol and atol are tolerances.
Importantly, you see that I flagged dense_output=True, which means that the solver does not only return the solutions on the numerical grids, but also as interpolation functions sol0.sol(t) and sol1.sol(t).
My main goal now is to define a piecewise function, say sol(t) which takes the value sol0.sol(t) for t<0 and the value sol1.sol(t) for t>=0. So the main question is: How do I do that?
I thought that numpy.piecewise should be tool of choice to do this for me. But I am having trouble using it, as you will see below, where I show you what I tried so far.
2. Example code
The code in the box below solves the initial value problem of my example. Most of the code is the definition of the rhs function, the details of which are not important to the question.
import numpy as np
from scipy.integrate import solve_ivp
# aux definitions and constants
sin=np.sin; cos=np.cos; tan=np.tan; sqrt=np.sqrt; pi=np.pi;
c = 299792458
Gm = 5.655090674872875e26
# define right hand side function of initial value problem, y'(t) = rhs(t,y)
def rhs(t,y):
p,e,i,Om,om,f = y
sinf=np.sin(f); cosf=np.cos(f); Q=sqrt(p/Gm); opecf=1+e*cosf;
R = Gm**2/(c**2*p**3)*opecf**2*(3*(e**2 + 1) + 2*e*cosf - 4*e**2*cosf**2)
S = Gm**2/(c**2*p**3)*4*opecf**3*e*sinf
rhs = np.zeros(6)
rhs[0] = 2*sqrt(p**3/Gm)/opecf*S
rhs[1] = Q*(sinf*R + (2*cosf + e*(1 + cosf**2))/opecf*S)
rhs[2] = 0
rhs[3] = 0
rhs[4] = Q/e*(-cosf*R + (2 + e*cosf)/opecf*sinf*S)
rhs[5] = sqrt(Gm/p**3)*opecf**2 + Q/e*(cosf*R - (2 + e*cosf)/opecf*sinf*S)
return rhs
# define initial values, y0
y0=[3.3578528933149297e13,0.8846,2.34921,3.98284,1.15715,0]
# integrate twice from t = 0, once backward in time (sol0) and once forward in time (sol1)
sol0 = solve_ivp(rhs,[0,-1e8],y0,rtol=10e-12,atol=10e-12,dense_output=True)
sol1 = solve_ivp(rhs,[0, 1e8],y0,rtol=10e-12,atol=10e-12,dense_output=True)
The solution functions can be addressed from here by sol0.sol and sol1.sol respectively. As an example, let's plot the 4th component:
from matplotlib import pyplot as plt
t0 = np.linspace(-1,0,500)*1e8
t1 = np.linspace( 0,1,500)*1e8
plt.plot(t0,sol0.sol(t0)[4])
plt.plot(t1,sol1.sol(t1)[4])
plt.title('plot 1')
plt.show()
3. Failing attempts to build piecewise function
3.1 Build vector valued piecewise function directly out of sol0.sol and sol1.sol
def sol(t): return np.piecewise(t,[t<0,t>=0],[sol0.sol,sol1.sol])
t = np.linspace(-1,1,1000)*1e8
print(sol(t))
This leads to the following error in piecewise in line 628 of .../numpy/lib/function_base.py:
TypeError: NumPy boolean array indexing assignment requires a 0 or 1-dimensional input, input has 2 dimensions
I am not sure, but I do think this is because of the following: In the documentation of piecewise it says about the third argument:
funclistlist of callables, f(x,*args,**kw), or scalars
[...]. It should take a 1d array as input and give an 1d array or a scalar value as output. [...].
I suppose the problem is, that the solution in my case has six components. Hence, evaluated on a time grid the output would be a 2d array. Can someone confirm, that this is indeed the problem? Since I think this really limits the usefulness of piecewiseby a lot.
3.2 Try the same, but just for one component (e.g. for the 4th)
def sol4(t): return np.piecewise(t,[t<0,t>=0],[sol0.sol(t)[4],sol1.sol(t)[4]])
t = np.linspace(-1,1,1000)*1e8
print(sol4(t))
This results in this error in line 624 of the same file as above:
ValueError: NumPy boolean array indexing assignment cannot assign 1000 input values to the 500 output values where the mask is true
Contrary to the previous error, unfortunately here I have so far no idea why it is not working.
3.3 Similar attempt, however first defining functions for the 4th components
def sol40(t): return sol0.sol(t)[4]
def sol41(t): return sol1.sol(t)[4]
def sol4(t): return np.piecewise(t,[t<0,t>=0],[sol40,sol41])
t = np.linspace(-1,1,1000)
plt.plot(t,sol4(t))
plt.title('plot 2')
plt.show()
Now this does not result in an error, and I can produce a plot, however this plot doesn't look like it should. It should look like plot 1 above. Also here, I so far have no clue what is going on.
Am thankful for help!
You can take a look to numpy.piecewise source code. There is nothing special in this function so I suggest to do everything manually.
def sol(t):
ans = np.empty((6, len(t)))
ans[:, t<0] = sol0.sol(t[t<0])
ans[:, t>=0] = sol1.sol(t[t>=0])
return ans
Regarding your failed attempts. Yes, piecewise excpect functions return 1d array. Your second attempt failed because documentation says that funclist argument should be list of functions or scalars but you send the list of arrays. Contrary to the documentation it works even with arrays, you just should use the arrays of the same size as t < 0 and t >= 0 like:
def sol4(t): return np.piecewise(t,[t<0,t>=0],[sol0.sol(t[t<0])[4],sol1.sol(t[t>=0])[4]])

Why is binned_statistic_2d now throwing TypeError?

I have been using scipy's binned_statistic_2d function to plot a 2d histogram of some data, particularly to return a list of the index of which bin the data is in, by setting the expand_binnumbers = True. It was working perfectly, until today. The following code demonstrates my problem:
import numpy as np
from scipy.stats import binned_statistic_2d as hist
# my data is two arrays of numbers
x = np.random.random((5,))
y = np.random.random((5,))
# I need to know which bin the values are in so I return the bin_idx
data = hist(x,y, bins = [2,2], statistic = 'count', values = None, expand_binnumbers = True)
bin_idx = data[3]
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Any ideas why this should suddenly stop working?
The recent update to Scipy broke things somewhat - as #WarrenWeckesser said in the comments, setting values = x makes things work again.

Pycopula library inpu

I am having an issue with the pycopula library.
The example (provided on https://github.com/blent-ai/pycopula) imports a csv dataset and then uses it in the function. I have generated two random variable, uniformly distributed, and combined them into a pd.DataFrame() . I then tried to estimate a Clayton copula.
import pandas as pd
from pycopula.copula import ArchimedeanCopula
x1 = np.random.uniform(size=3000)
x2 = np.random.uniform(size=3000)
X = pd.DataFrame(); X[0]=x1; X[1]=x2
archimedean = ArchimedeanCopula(family="clayton", dim=2)
archimedean.fit(X, method="cmle")
I am getting a TypeError: '(0, slice(None, None, None))' is an invalid key. If anyone has used this library before and knows what input does the function take, I would be grateful. The full documentation link that it is provided on GitHub redirects me to a non-existing website (Error 404). Thanks!
I think that the method fit() takes data in numpy array type. You can't put a dataframe into it.
X : numpy array (of size n * copula dimension)
Use Dataframe.to_numpy() to change into right type. Hope it works.

Error in acorr_ljungbox from statsmodel

So I am trying to do a box-ljung test on a resudual, but I am getting a strange error and am not able to figure out why.
x = diag.acorr_ljungbox(np.random.random(20))
I tried doing the same with a random array also, still the same error:
ValueError: operands could not be broadcast together with shapes (19,) (40,)
This looks like a bug in the default lag setting, which is set to 40 independent of the length of the data.
As a workaround and to get a proper statistic, the lags needs to be restricted, e.g. using 5 lags below.
>>> from statsmodels.stats import diagnostic as diag
>>> diag.acorr_ljungbox(np.random.random(50))[0].shape
(40,)
>>> diag.acorr_ljungbox(np.random.random(20), lags=5)
(array([ 0.36718151, 1.02009595, 1.23734092, 3.75338034, 4.35387236]),
array([ 0.54454461, 0.60046677, 0.74406305, 0.44040973, 0.49966951]))

Indexing a 2D numpy array inside a function using a function parameter

Say I have a 2D image in python stored in a numpy array and called npimage (1024 times 1024 pixels).
I would like to define a function ShowImage, that take as a paramater a slice of the numpy array:
def ShowImage(npimage,SliceNumpy):
imshow(npimage(SliceNumpy))
such that it can plot a given part of the image, lets say:
ShowImage(npimage,[200:800,300:500])
would plot the image for lines between 200 and 800 and columns between 300 and 500, i.e.
imshow(npimage[200:800,300:500])
Is it possible to do that in python? For the moment passing something like [200:800,300:500] as a parameter to a function result in error.
Thanks for any help or link.
Greg
It's not possible because [...] are a syntax error when not directly used as slice, but you could do:
Give only the relevant sliced image - not with a seperate argument ShowImage(npimage[200:800,300:500]) (no comma)
or give a tuple of slices as argument: ShowImage(npimage,(slice(200,800),slice(300:500))). Those can be used for slicing inside the function because they are just another way of defining this slice:
npimage[(slice(200,800),slice(300, 500))] == npimage[200:800,300:500]
A possible solution for the second option could be:
import matplotlib.pyplot as plt
def ShowImage(npimage, SliceNumpy):
plt.imshow(npimage[SliceNumpy])
plt.show()
ShowImage(npimage, (slice(200,800),slice(300, 500)))
# plots the relevant slice of the array.

Categories

Resources