I want to use the function InterX to find the intersection of two curves. However the function does not return the expected result. The function is availabel here
The function always return the point of intersection as P = None, None. When a valid point was expected.
import numpy as np
import pandas as pd
from InterX import InterX
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
X_P = np.array((2,4))
Z_P = np.array((3,-1))
Line = pd.DataFrame(np.array((X_P,Z_P)))
Curve = pd.DataFrame(np.array([x_t,z_t]))
Curve = Curve.T
P = InterX(Line[0],Line[1],Curve[0],Curve[1])
In this script the expected result was P = [3.5,0]. However, the resulting point P is P = [None,None]
The short answer - use:
P = InterX(L1, L1, L2, L2)
or
P = InterX(L1.iloc[:,0].to_frame(),L1.iloc[:,1].to_frame(),L2.iloc[:,0].to_frame(),L2.iloc[:,1].to_frame())
For a detailed answer see the following that refers to the code of your original question.
This refers to the code of the original question:
You need two pass two dataframes with x and y values (it would be of course much more logical if InterX would accept 4 Series or 2 DataFrames respectively).
InterX then gets the x and y values in a very convoluted way from these dataframes in lines 90 through 119 (which could be done far more easyly). So the working solution is:
import numpy as np
import pandas as pd
from InterX import InterX
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
x_P = np.array((2,4))
z_P = np.array((3,-1))
curve_x = pd.DataFrame(x_t)
curve_z = pd.DataFrame(z_t)
line_x = pd.DataFrame(X_P)
line_z = pd.DataFrame(Z_P)
p = InterX(line_x, line_z, curve_x, curve_z)
Output of print(p):
xs ys
0 3.5 0.0
Please note that according to the python naming convention (PEP8) function and variable names should be lowercase, with words separated by underscores.
I find the code of InterX very difficult to understand, a much cleaner solution (along with a nice plot) is this one.
With
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
X_P = np.array((2,4))
Z_P = np.array((3,-1))
x,y = intersection(x_t,z_t,X_P,Z_P)
print(x,y)
plt.plot(x_t,z_t,c='r')
plt.plot(X_P,Z_P,c='g')
plt.plot(x,y,'*k')
plt.show()
we get [3.5] [-0.] and this picture:
Related
I want to use the values in each row of the df as parameters to simulate data N times and store sum of simulated data somewhere.
The reproducible example works - but as N gets larger in combination with more rows in df it becomes very time consuming.
Is there a way to optimize? I have been looking into vectorization and maybe using multiple generators to pipeline the operations or list comprehension but am stuck.
import numpy as np
import pandas as pd
from pert import PERT
data = [[0.1, 0.14, 0.25, 50, 100, 150], [0.01, 0.03, 0.1, 200, 250, 300]]
df = pd.DataFrame(data, columns = ["A", "B", "C", "D", "E", "F"])
N = 1000
confidence = 4
empty_list = []
for _ in range(N):
df["P"] = df.apply(lambda x: PERT(x.A, x.B, x.C, confidence).rvs(1), axis = 1).astype(float)
df["O"] = df.apply(lambda x: np.random.binomial(1, x.P, 1), axis = 1).astype(int)
df["L"] = df.apply(lambda x: PERT(x.D, x.E, x.F, confidence).rvs(1) if x.O == 1 else 0, axis = 1).astype(int)
empty_list.append(sum(df["L"]))
df1 = pd.DataFrame(empty_list, columns = ["L"])
P50 = np.percentile(df1["L"], 50).astype(int)
My first answer was using a general attempt to vectorization, not knowing what PERT really was and where it was from. Assuming that PERT is from pertdist, there is a much shorter and easier way to vectorize the code. The other answer is still useful for people that come here with a slightly different problem.
PERT directly allows to draw more than one sample.
That makes things a lot easier.
import numpy as np
import pandas as pd
from pert import PERT # pertdist package
data = [[0.1, 0.14, 0.25, 50, 100, 150], [0.01, 0.03, 0.1, 200, 250, 300]]
df = pd.DataFrame(data, columns = ["A", "B", "C", "D", "E", "F"])
N = 1000
confidence = 4
P = PERT(df["A"], df["B"], df["C"], confidence).rvs((N, len(df)))
O = np.random.binomial(np.ones(len(df), dtype=int), P)
L = PERT(df["D"], df["E"], df["F"], confidence).rvs((N, len(df)))
L = np.where(O == 1, L, 0).astype(int)
data1 = np.sum(L, axis=1)
P50 = np.percentile(df1["L"], 50).astype(int)
This answer doesn't make use of the capability of PERT (from pertdist) to draw more than one sample at a time.
Using that allows to reduce the complexity of vectorization a lot.
This is outlined in my second answer.
This answer is still useful for people that come here with different problems that they want to vectorize, as it uses a more general approach.
I think you may get a nice speed-up from vectorization.
When vectorizing, you would
replace apply with computing "P", "O" and "L" each in a single go for the entire data frame
avoid the for loop, for example by using a 3D numpy array.
For step 1 you would do
import numpy as np
import pandas as pd
from pert import PERT # pertdist package
data = [
[0.1, 0.14, 0.25, 50, 100, 150],
[0.01, 0.03, 0.1, 200, 250, 300],
]
confidence = 4
df = pd.DataFrame(data, columns = ["A", "B", "C", "D", "E", "F"])
# Just pass None for size to get a vectorized version of PERT.
# I think this is undocumented behaviour of scipy.stats.beta,
# which PERT.rvs uses unter the hood
df["P"] = PERT(df["A"], df["B"], df["C"], confidence).rvs(None)
df["O"] = np.random.binomial(np.ones(len(df), dtype=int), df["P"])
L = PERT(df["D"], df["E"], df["F"], confidence).rvs(None)
df["L"] = np.where(df["O"] == 1, L, 0).astype(int) # replaces ... if ... else ...
sum(df["L"])
Now I have computed one result, but we need N. Doing this in one go is a bit more tricky. This is step 2. One idea is not to use a 2D data frame, but a 3D numpy array (a 3D tensor).
Luckily PERT also supports this.
import numpy as np
from pert import PERT # pertdist package
data = np.array([
[0.1, 0.14, 0.25, 50, 100, 150],
[0.01, 0.03, 0.1, 200, 250, 300],
])
N = 1000
confidence = 4
data_N = np.array([ # there is probably a faster way to do this
data
for _ in range(N)
])
print(data_N.shape) # 1000, 2, 6 => 3D array/tensor
A = data_N[:, :, 0]
B = data_N[:, :, 1]
C = data_N[:, :, 2]
D = data_N[:, :, 3]
E = data_N[:, :, 4]
F = data_N[:, :, 5]
P = PERT(A, B, C, confidence).rvs(None)
print(P.shape) # 1000, 2 => P values for each row of the original df, 1000 times
# independently drawing samples for each value in P
O = np.random.binomial(np.ones_like(P, dtype=int), P)
L = PERT(D, E, F, confidence).rvs(None)
L = np.where(O == 1, L, 0).astype(int) # replaces ... if ... else ...
data1 = np.sum(L, axis=1) # sum over L for each of the N samples
print(data1.shape) # 1000, => 1000 results
for percent in [10, 25, 50, 75, 90]:
print(np.percentile(data1, percent))
So I have 2 surfaces (PolyData in PyVista) one on top of another:
They are shaped a little differently on Z access yet whenever a top one has a Z value on X, Y plane we are sure a-bottom one has the same. So how one can merge two surfaces X, Y aligned into one solid mesh?
What I try:
import numpy as np
import pyvista as pv
import vtk
def extruder(mesh, val_z):
extrude = vtk.vtkLinearExtrusionFilter()
extrude.SetInputData(mesh)
extrude.SetVector(0, 0, val_z)
extrude.Update()
extruded_mesh = pv.wrap(extrude.GetOutput())
return extruded_mesh
# generate two sheets of input data
noise = pv.perlin_noise(2, (0.2, 0.2, 0.2), (0, 0, 0))
bounds_2d = (-10, 10, -10, 10)
dim = (40, 50, 1)
bottom, top = [
pv.sample_function(noise, dim=dim, bounds=bounds_2d + (z, z)).warp_by_scalar()
for z in [-5, 5]
]
bottom = bottom.extract_surface(nonlinear_subdivision=5)
top = top.extract_surface(nonlinear_subdivision=5)
top = extruder(top, -50).triangulate()
bottom = extruder(bottom, 50).triangulate()
intersection = bottom.boolean_cut(top)
#top = top.clip_surface(bottom, invert=False, compute_distance=True)
#top = top.extrude([0, 0, -50]).triangulate()
#bottom = bottom.extrude([0, 0, 50]).triangulate()
#intersection = bottom.boolean_cut(top).triangulate()
p = pv.Plotter()
p.add_mesh(top, cmap="hot", opacity=0.15)
p.add_mesh(bottom, cmap="RdYlBu", opacity=0.15)
p.add_mesh(intersection, cmap="Dark2", opacity=1)
p.show()
What do I get:
What I expected:
only middle to be filled.
So had to do this:
import numpy as np
import pyvista as pv
# generate two sheets of input data
noise = pv.perlin_noise(2, (0.2, 0.2, 0.2), (0, 0, 0))
bounds_2d = (-10, 10, -10, 10)
dim = (40, 50, 1)
bottom, top = [
pv.sample_function(noise, dim=dim, bounds=bounds_2d + (z, z)).warp_by_scalar()
for z in [-5, 5]
]
bottom = bottom.extract_surface()
top = top.extract_surface()
topm = top.extrude([0, 0, -50]).triangulate().clean()
bottomm = bottom.extrude([0, 0, 50]).triangulate().clean()
topm = topm.clip_surface(bottom, invert=False)
bottomm = bottomm.clip_surface(top, invert=True)
intersection = topm.boolean_add(bottomm).triangulate().clean().subdivide(2).clean()
p = pv.Plotter()
#p.add_mesh(topm, cmap="hot", opacity=0.15)
#p.add_mesh(bottomm, cmap="gnuplot2", opacity=0.15)
p.add_mesh(intersection, cmap="Dark2", opacity=1)
p.show()
the resulting mesh is really bad, yet it has desired shape and gets to be computed in usable time:
I have to plot many plots in the same graph. The x values is the same array for all and it is an array from 0 to N. The Y values for each plot are arrays that start with 0 and start having positive values at different x, depending on the plot.
EXAMPLE:
x = np.arange(100)
y1 = [0, 0, 10, 12 , 53, ... , n]
y2 = [0, 0, 0, 12 , 53, ... , n]
y3 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 40, 67, 53, ... , n]
when I plot there is a vertical line that goes from the bottom to the first positive value for Y. In the case of y1, there is line from (1, 0) to (2, 10) that is the line i want to avoid and just plot from (2, 10).
I know I can create new arrays for x and y to match the conditions I want, but I really need to know if there is other way.
There is an image with one example of my current plot.
Link of image
CODE:
import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
# This is a excel where a user types a number, this number will be the number
of months.
workbook = xlrd.open_workbook('INPUT.xlsx')
sheet1 = workbook.sheet_by_name('ASSUMPTIONS')
Num_Meses = np.array([i for i in range(int(sheet1.cell(5, 5).value) + 1)])
# Then I create a dictonary from which I take the arrays, (YPP, Y1P, Y2P)
are type 'numpy.ndarray'
filt = df['WELL TYPE'] == 'PP'
YPP = df.loc[filt, 'OIL PRODUCTION'][0]
filt = df['WELL TYPE'] == '1P'
Y1P = df.loc[filt, 'OIL PRODUCTION'][0] + YPP
filt = df['WELL TYPE'] == '2P'
Y2P = df.loc[filt, 'OIL PRODUCTION'][0] + Y1P
filt = df['WELL TYPE'] == '3P'
Y3P = df.loc[filt, 'OIL PRODUCTION'][0] + Y2P
plt.plot(Num_Meses, Y3P, label='3P')
plt.plot(Num_Meses, Y2P, label='2P')
plt.plot(Num_Meses, Y1P, label='1P')
plt.plot(Num_Meses, YPP, label='PP', color='k')
A test code for this type of data:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,1,20)
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0])
n = np.size(x)
mean = sum(x*y)/n
sigma = np.sqrt(sum(y*(x-mean)**2)/n)
def gaus(x,a,x0,sigma):
return a*np.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=[max(y),mean,sigma])
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
I need to fit lots of data which is just like the y array given above to a Gaussian distribution.
Using the standard gaussian fitting routine using scipy.optimize gives this kind of fit:
I have tried many different initial values, but cannot get any kind of fit.
Does anyone have any ideas how I could get this data fitted to a Gaussian?
Thanks
The problem
Your fundamental problem is that you have a severely undetermined fitting problem. Think about it like this: you have three unknowns but only one datapoint. This is akin to solving for x, y, z when you only have one equation. Because the height of your gaussian can vary independently of it's width, there are infinitely many distributions, all with different widths that will satisfy the constraints of your fit.
More directly, your a and sigma parameters can both change the maximum height of the distribution, which is pretty much the only thing that matters in terms of achieving a good fit (at least once the distribution is centered and fairly narrow). Thus, the fitting routines in Scipy can't figure which to change at any given step.
The fix
The simplest way to solve the problem is to lock down one of your parameters. You don't need to change your equation, but you do need to make at least one of a, x0, or sigma a constant. The best choice of parameter to fix is probably x0, since it's trivial to determine the mean/median/mode of you data by just getting the x coordinate of the one datapoint that is non-zero in y. You'll also need to get a little more clever about how you set your initial guesses. Here's what that looks like:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,1,20)
xdiff = x[1] - x[0]
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0])
# the mean/median/mode all occur at the x coordinate of the one datapoint that is non-zero in y
mean = x[np.argmax(y)]
# sigma should be tiny, since we want a narrow distribution
sigma = xdiff
# the scaling factor should be roughly equal to the "height" of the one datapoint
a = y.max()
def gaus(x,a,sigma):
return a*np.exp(-(x-mean)**2/(2*sigma**2))
bounds = ((1, .015), (20, 1))
popt,pcov = curve_fit(gaus, x, y, p0=[a, sigma], maxfev=20000, bounds=bounds)
residual = ((gaus(x,*popt) - y)**2).sum()
plt.figure(figsize=(8,6))
plt.plot(x,y,'b+:',label='data')
xdist = np.linspace(x.min(), x.max(), 1000)
plt.plot(xdist,gaus(xdist,*popt),'C0', label='fit distribution')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.text(.1,6,"residual: %.6e" % residual)
plt.legend()
plt.show()
Output:
The better fix
You don't need a fit to get the kind of Gaussians you want. You can instead use a simple closed form expression to calculate the parameters that you need, as in the fitonegauss function in the code below:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def gauss(x, a, mean, sigma):
return a*np.exp(-(x - mean)**2/(2*sigma**2))
def fitonegauss(x, y, fwhm=None):
if fwhm is None:
# determine full width at half maximum from the spacing between the x points
fwhm = (x[1] - x[0])
# the mean/median/mode all occur at the x coordinate of the one datapoint that is non-zero in y
mean = x[np.argmax(y)]
# solve for sigma in terms of the desired full width at half maximum
sigma = fwhm/(2*np.sqrt(2*np.log(2)))
# max(pdf) == 1/(np.sqrt(2*np.pi)*sigma). Use that to determine a
a = y.max() #(np.sqrt(2*np.pi)*sigma)
return a, mean, sigma
N = 20
x = np.linspace(0,1,N)
y = np.zeros(N)
y[N//2] = 10
popt = fitonegauss(x, y)
plt.figure(figsize=(8,6))
plt.plot(x,y,'b+:',label='data')
xdist = np.linspace(x.min(), x.max(), 1000)
plt.plot(xdist,gauss(xdist,*popt),'C0', label='fit distribution')
residual = ((gauss(x,*popt) - y)**2).sum()
plt.plot(x, gauss(x,*popt),'ro:',label='fit')
plt.text(.1,6,"residual: %.6e" % residual)
plt.legend()
plt.show()
Output:
The advantages of this approach are many. It's far more computationally efficient than any fit could be, it will (for the most part) never fail, and it gives you far more control over the actual width of the distribution that you end up with.
The fitonegauss function is set up so that you can directly set the full width at half maximum of the fitted distribution. If you leave it unset, the code will automatically guess it from the spacing of the x data. This seems to produce reasonable results for your application.
Don't use a general "a" parameter, use the proper normal distribution equation instead:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = np.linspace(0,1,20)
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0])
n = np.size(x)
mean = sum(x*y)/n
sigma = np.sqrt(sum(y*(x-mean)**2)/n)
def gaus(x, x0, sigma):
return 1/np.sqrt(2 * np.pi * sigma**2)*np.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=[mean,sigma])
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
Is it any fast way to merge two numpy histograms with different bin ranges and bin number?
For example:
x = [1,2,2,3]
y = [4,5,5,6]
a = np.histogram(x, bins=10)
# a[0] = [1, 0, 0, 0, 0, 2, 0, 0, 0, 1]
# a[1] = [ 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. ]
b = np.histogram(y, bins=5)
# b[0] = [1, 0, 2, 0, 1]
# b[1] = [ 4. , 4.4, 4.8, 5.2, 5.6, 6. ]
Now I want to have some function like this:
def merge(a, b):
# some actions here #
return merged_a_b_values, merged_a_b_bins
Actually I have not x and y, a and b are known only.
But the result of merge(a, b) must be equal to np.histogram(x+y, bins=10):
m = merge(a, b)
# m[0] = [1, 0, 2, 0, 1, 0, 1, 0, 2, 1]
# m[1] = [ 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ]
I'd actually have added a comment to dangom's answer, but I lack the reputation required.
I'm a little confused by your example. You're plotting the histogram of the histogram bins if I'm not mistaken. It should rather be this, right?
plt.figure()
plt.plot(a[1][:-1], a[0], marker='.', label='a')
plt.plot(b[1][:-1], b[0], marker='.', label='b')
plt.plot(c[1][:-1], c[0], marker='.', label='c')
plt.legend()
plt.show()
Also a note to your suggestion for combining the histogram. You are of course right, that there's no unique solution as you simply don't know, where the samples would've have been in the finer grid you use for the combination. When having two histograms, which have a significantly differing bin width the suggested merging function may result in a sparse and artificial looking histogram.
I tried combining the histograms by interpolation (assuming the samples within the count bin were distributed uniformly in the original bin - which is of course also only an assumption).
This leads however to a more natural looking result, at least for data sampled from distributions I typically encounter.
import numpy as np
def merge_hist(a, b):
edgesa = a[1]
edgesb = b[1]
da = edgesa[1]-edgesa[0]
db = edgesb[1]-edgesb[0]
dint = np.min([da, db])
min = np.min(np.hstack([edgesa, edgesb]))
max = np.max(np.hstack([edgesa, edgesb]))
edgesc = np.arange(min, max, dint)
def interpolate_hist(edgesint, edges, hist):
cumhist = np.hstack([0, np.cumsum(hist)])
cumhistint = np.interp(edgesint, edges, cumhist)
histint = np.diff(cumhistint)
return histint
histaint = interpolate_hist(edgesc, edgesa, a[0])
histbint = interpolate_hist(edgesc, edgesb, b[0])
c = histaint + histbint
return c, edgesc
An example for two gaussian distributions:
import numpy as np
a = 5 + 1*np.random.randn(100)
b = 10 + 2*np.random.randn(100)
hista, edgesa = np.histogram(a, bins=10)
histb, edgesb = np.histogram(b, bins=5)
histc, edgesc = merge_hist([hista, edgesa], [histb, edgesb])
plt.figure()
width = edgesa[1]-edgesa[0]
plt.bar(edgesa[:-1], hista, width=width)
width = edgesb[1]-edgesb[0]
plt.bar(edgesb[:-1], histb, width=width)
plt.figure()
width = edgesc[1]-edgesc[0]
plt.bar(edgesc[:-1], histc, width=width)
plt.show()
I, however, am no statistician, so please let me know if the suggestes approach is viable.
There is no unique solution to the problem of merging two different histograms. I propose here a simple and quick solution based on two design assumptions necessary to deal with the loss of information inherent from binning sequences:
Recovered values are represented by the start of the bin they belong to.
The merge shall keep the highest bin resolution to avoid further loss of information and shall completely encompass the intervals of the children histograms.
Here's the code:
import numpy as np
def merge(a, b):
def extract_vals(hist):
# Recover values based on assumption 1.
values = [[y]*x for x, y in zip(hist[0], hist[1])]
# Return flattened list.
return [z for s in values for z in s]
def extract_bin_resolution(hist):
return hist[1][1] - hist[1][0]
def generate_num_bins(minval, maxval, bin_resolution):
# Generate number of bins necessary to satisfy assumption 2
return int(np.ceil((maxval - minval) / bin_resolution))
vals = extract_vals(a) + extract_vals(b)
bin_resolution = min(map(extract_bin_resolution, [a, b]))
num_bins = generate_num_bins(min(vals), max(vals), bin_resolution)
return np.histogram(vals, bins=num_bins)
Here's the example code:
import matplotlib.pyplot as plt
x = [1,2,2,3]
y = [4,5,5,6]
a = np.histogram(x, bins=10)
# a[0] = [1, 0, 0, 0, 0, 2, 0, 0, 0, 1]
# a[1] = [ 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. ]
b = np.histogram(y, bins=5)
# b[0] = [1, 0, 2, 0, 1]
# b[1] = [ 4. , 4.4, 4.8, 5.2, 5.6, 6. ]
# Merge and plot results
c = merge(a, b)
c_num_bins = c[1].size - 1
plt.hist(a[0], bins=5, label='a')
plt.hist(b[0], bins=10, label='b')
plt.hist(c[0], bins=c_num_bins, label='c')
plt.legend()
plt.show()